normet

prepare_data(df, value, feature_names, na_rm=True, split_method='random', replace=False, fraction=0.75, seed=7654321)

Prepares the input DataFrame by performing data cleaning, imputation, and splitting.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • value (str, optional) – Name of the target variable. Default is ‘value’.

  • feature_names (list, optional) – List of feature names. Default is None.

  • na_rm (bool, optional) – Whether to remove missing values. Default is True.

  • split_method (str, optional) – Method for splitting data (‘random’ or ‘time_series’). Default is ‘random’.

  • replace (bool, optional) – Whether to replace existing date variables. Default is False.

  • fraction (float, optional) – Fraction of the dataset to be used for training. Default is 0.75.

  • seed (int, optional) – Seed for random operations. Default is 7654321.

Returns:

Prepared DataFrame with cleaned data and split into training and testing sets.

Return type:

pandas.DataFrame

Details:

  • Data Cleaning: Checks the input data for consistency and validity using the check_data function.

  • Imputation: Handles missing values according to the na_rm parameter using the impute_values function.

  • Date Variables: Adds or replaces date-related variables in the dataset using the add_date_variables function.

  • Data Splitting: Splits the data into training and testing sets using the split_into_sets function based on the specified split_method.

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
prepared_df = nm.prepare_data(df, value, feature_names, split_method='time_series', fraction=0.8)

Notes:

  • This function is a pipeline that sequentially applies various data preparation steps to ensure the dataset is clean and ready for modeling.

  • The split_method parameter allows flexibility in how the data is split, supporting both random and time-series based methods.

  • The seed parameter ensures reproducibility in random operations, particularly useful when split_method is ‘random’.

process_df(df, variables_col)

Processes the DataFrame to ensure it contains necessary date and selected feature columns.

This function checks if the date is present in the index or columns, selects the necessary features and the date column, and prepares the DataFrame for further analysis.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • variables_col (list of str) – List of variable names to be included in the DataFrame.

Returns:

Processed DataFrame containing the date and selected feature columns.

Return type:

pandas.DataFrame

Raises:

ValueError – If no datetime information is found in index or ‘date’ column.

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('data.csv')
variables_col = ['feature1', 'feature2', 'feature3']
processed_df = nm.process_df(df, variables_col)
check_data(df, value)

Validates and preprocesses the input DataFrame for subsequent analysis or modeling.

This function checks if the target variable is present, ensures the date column is of the correct type, and validates there are no missing dates, returning a DataFrame with the target column renamed for consistency.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the data to be checked.

  • value (str) – Name of the target variable (column) to be used in the analysis.

Returns:

A DataFrame containing only the necessary columns, with appropriate checks and transformations applied.

Return type:

pandas.DataFrame

Raises:

ValueError

  • If the target variable (value) is not in the DataFrame columns.

  • If there is no datetime information in either the index or the ‘date’ column.

  • If the ‘date’ column is not of type datetime64.

  • If the ‘date’ column contains missing values.

Notes:
  • If the DataFrame’s index is a DatetimeIndex, it is reset to a column named ‘date’.

  • The target column (value) is renamed to ‘value’.

Example:

import pandas as pd
import normet as nm
data = {
     'timestamp': pd.date_range(start='1/1/2020', periods=5, freq='D'),
     'target': [1, 2, 3, 4, 5]
 }
df = pd.DataFrame(data).set_index('timestamp')
df_checked = nm.check_data(df, 'target')
print(df_checked)
impute_values(df, na_rm)

Imputes missing values in the DataFrame.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • na_rm (bool) – Whether to remove missing values.

Returns:

DataFrame with imputed missing values.

Return type:

pandas.DataFrame

Details:

  • Missing Values Handling: Depending on the value of na_rm, missing values can either be removed (na_rm=True) or imputed.

  • Numeric Variables: Missing values in numeric columns are filled with the median of each column.

  • Categorical Variables: Missing values in categorical columns (object or category dtype) are filled with the mode (most frequent value) of each column.

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('data.csv')
cleaned_df = nm.impute_values(df, na_rm=True)
print(cleaned_df.head())
add_date_variables(df, replace)

Adds date-related variables to the DataFrame.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • replace (bool) – Whether to replace existing date variables.

Returns:

DataFrame with added date-related variables.

Return type:

pandas.DataFrame

Details:

  • Date Variables Addition: Depending on the replace parameter, new date-related variables such as ‘date_unix’, ‘day_julian’, ‘weekday’, and ‘hour’ are added to the DataFrame.

  • Replace Existing Variables: If replace=True, existing date-related variables are overwritten with new values.

  • Non-replacement Logic: If replace=False, new date-related variables are added only if they do not already exist in the DataFrame.

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('data.csv')
enriched_df = nm.add_date_variables(df, replace=True)
print(enriched_df.head())
split_into_sets(df, split_method, fraction, seed)

Splits the DataFrame into training and testing sets based on the specified split method.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • split_method (str) – Method for splitting data (‘random’, ‘ts’, ‘season’, ‘month’).

  • fraction (float) – Fraction of the dataset to be used for training (for ‘random’, ‘ts’, ‘season’) or fraction of each month to be used for training (for ‘month’).

  • seed (int) – Seed for random operations.

Returns:

DataFrame with a ‘set’ column indicating the training or testing set.

Return type:

pandas.DataFrame

Example:

import pandas as pd
import normet as nm
data = {
     'date': pd.date_range(start='2020-01-01', periods=365),
     'value': range(365)
 }
df = pd.DataFrame(data)
df_split = nm.split_into_sets(df, split_method='season', fraction=0.8, seed=12345)

Notes:

  • Depending on the split_method:
    • ‘random’: Randomly splits the data into training and testing sets.

    • ‘ts’: Splits the data based on a fraction of the total length.

    • ‘season’: Splits the data into seasonal sets based on the month of the year.

    • ‘month’: Splits the data into monthly sets.

  • Each resulting DataFrame will have a ‘set’ column indicating whether the row belongs to the ‘training’ or ‘testing’ set.

train_model(df, value='value', variables=None, model_config=None, seed=7654321, verbose=True)

Trains a machine learning model using the provided dataset and parameters.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • value (str, optional) – Name of the target variable. Default is ‘value’.

  • variables (list of str) – List of feature variables. Default is None.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • seed (int, optional) – Random seed for reproducibility. Default is 7654321.

  • verbose (bool, optional) – If True, print progress messages. Default is True.

Returns:

Trained ML model object.

Return type:

object

Raises:

ValueError – If variables contains duplicates or if any variables are not present in the DataFrame.

Example:

import pandas as pd
import normet as nm
data = {
     'feature1': [1, 2, 3, 4, 5],
     'feature2': [5, 4, 3, 2, 1],
     'target': [10, 20, 30, 40, 50],
     'set': ['training', 'training', 'training', 'validation', 'validation']
 }
df = pd.DataFrame(data)
model = nm.train_model(df, value='target', variables=['feature1', 'feature2'])

Notes:

  • If the ‘set’ column is present in the DataFrame, only rows where set is ‘training’ are used for training.

  • The default model_config includes:

model_config = {
'time_budget': 90,                     # Total running time in seconds
'metric': 'r2',                        # Primary metric for regression, 'mae', 'mse', 'r2', 'mape',...
'estimator_list': ["lgbm"],            # List of ML learners: ["lgbm", "rf", "xgboost", "extra_tree", "xgb_limitdepth"]
'task': 'regression',                  # Task type
'verbose': verbose                     # Print progress messages
}
  • This configuration can be updated with user-provided model_config.

prepare_train_model(df, value, feature_names, split_method, fraction, model_config, seed, verbose=True)

Prepares the data and trains a machine learning model using the specified configuration.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame containing the data to be used for training.

  • value (str) – The name of the target variable to be predicted.

  • feature_names (list of str) – A list of feature column names to be used in the training.

  • split_method (str) – The method to split the data (‘random’ or other supported methods).

  • fraction (float) – The fraction of data to be used for training.

  • model_config (dict) – The configuration dictionary for the AutoML model training.

  • seed (int) – The random seed for reproducibility.

  • verbose (bool, optional) – If True, print progress messages. Default is True.

Returns:

A tuple containing: - pd.DataFrame: The prepared DataFrame ready for model training. - object: The trained machine learning model.

Return type:

tuple

Raises:

ValueError – If there are any issues with the data preparation or model training.

Example:

import pandas as pd
import normet as nm
data = {
     'feature1': [1, 2, 3, 4, 5],
     'feature2': [5, 4, 3, 2, 1],
     'target': [2, 3, 4, 5, 6],
     'set': ['training', 'training', 'training', 'testing', 'testing']
 }
df = pd.DataFrame(data)
feature_names = ['feature1', 'feature2']
split_method = 'random'
fraction = 0.75
model_config = {'time_budget': 90, 'metric': 'r2'}
seed = 7654321
df_prepared, model = nm.prepare_train_model(df, value='target', feature_names=feature_names, split_method=split_method, fraction=fraction, model_config=model_config, seed=seed, verbose=True)

Notes:

  • The prepare_data function is called to preprocess and split the data based on the given split_method and fraction.

  • The train_model function is then used to train the model using the prepared data and specified model_config.

  • The default model_config includes:

model_config = {
'time_budget': 90,                     # Total running time in seconds
'metric': 'r2',                        # Primary metric for regression, 'mae', 'mse', 'r2', 'mape',...
'estimator_list': ["lgbm"],            # List of ML learners: "lgbm", "rf", "xgboost", "extra_tree", "xgb_limitdepth"
'task': 'regression',                  # Task type
'verbose': verbose                     # Print progress messages
}
  • The configuration for ML can be updated with user-provided model_config.

  • Any columns named ‘date_unix’, ‘day_julian’, ‘weekday’, or ‘hour’ are excluded from the feature variables before preparing the data.

normalise_worker(index, df, model, variables_resample, replace, seed, verbose, weather_df=None)

Worker function for parallel normalisation of data using randomly resampled meteorological parameters from another weather DataFrame within its date range. If no weather DataFrame is provided, it defaults to using the input DataFrame.

Parameters:
  • index (int) – Index of the worker.

  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • model (object) – Trained ML model.

  • variables_resample (list of str) – List of resampling variables.

  • replace (bool) – Whether to sample with replacement.

  • seed (int) – Random seed.

  • verbose (bool) – Whether to print progress messages.

  • weather_df (pandas.DataFrame, optional) – Weather DataFrame containing the meteorological parameters. Defaults to None.

Returns:

DataFrame containing normalised predictions.

Return type:

pandas.DataFrame

Example:

import pandas as pd
import normet as nm
data = {
     'date': pd.date_range(start='2020-01-01', periods=365),
     'value': range(365),
     'temp': np.random.rand(365),
     'humidity': np.random.rand(365)
 }
weather_data = {
     'temp': np.random.rand(100),
     'humidity': np.random.rand(100)
 }
df = pd.DataFrame(data)
weather_df = pd.DataFrame(weather_data)
model = nm.trained_model  # Assuming a trained model is available
predictions = nm.normalise_worker(
     index=0,
     df=df,
     model=model,
     variables_resample=['temp', 'humidity'],
     replace=True,
     seed=42,
     verbose=True,
     weather_df=weather_df
 )
print(predictions)

Notes:

  • Progress messages are printed every fifth prediction if verbose is set to True.

  • Meteorological parameters are resampled either from the provided weather_df or the input df if weather_df is not provided.

  • The function returns a DataFrame with the original date, observed values, normalised predictions, and the seed used for random sampling.

normalise(df, model, feature_names, variables_resample=None, n_samples=300, replace=True, aggregate=True, seed=7654321, n_cores=None, weather_df=None, verbose=True)

Normalises the dataset using a trained machine learning model and optionally resamples meteorological parameters from a provided weather DataFrame.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • model (object) – Trained ML model.

  • feature_names (list of str) – List of feature names.

  • variables_resample (list of str, optional) – List of resampling variables. Default is None.

  • n_samples (int, optional) – Number of samples to normalise. Default is 300.

  • replace (bool, optional) – Whether to replace existing data. Default is True.

  • aggregate (bool, optional) – Whether to aggregate results. Default is True.

  • seed (int, optional) – Random seed. Default is 7654321.

  • n_cores (int, optional) – Number of CPU cores to use. Default is total CPU cores minus one.

  • weather_df (pandas.DataFrame, optional) – DataFrame containing weather data for resampling. Default is None.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

DataFrame containing normalised predictions.

Return type:

pandas.DataFrame

Example:

import pandas as pd
import normet as nm
data = {
     'date': pd.date_range(start='2020-01-01', periods=5, freq='D'),
     'feature1': [1, 2, 3, 4, 5],
     'feature2': [5, 4, 3, 2, 1],
     'value': [2, 3, 4, 5, 6]
 }
df = pd.DataFrame(data)
feature_names = ['feature1', 'feature2']
model = nm.train_model(df, value='value', variables=feature_names)
variables_resample = ['feature1', 'feature2']
normalised_df = nm.normalise(df, model, feature_names, variables_resample)

Notes:

  • The function can optionally use a separate weather DataFrame for resampling meteorological parameters.

  • Progress messages are printed if verbose is set to True.

  • The number of CPU cores used for parallel processing can be specified, or defaults to the total number of cores minus one.

  • If aggregate is True, the results are averaged; otherwise, the function returns all individual predictions.

do_all(df=None, model=None, value=None, feature_names=None, variables_resample=None, split_method='random', fraction=0.75, model_config=None, n_samples=300, seed=7654321, n_cores=None, aggregate=True, weather_df=None, verbose=True)

Conducts data preparation, model training, and normalisation, returning the transformed dataset and model statistics.

This function performs the entire pipeline from data preparation to model training and normalisation using specified parameters and returns the transformed dataset along with model statistics.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • model (object, optional) – Pre-trained model to use for decomposition. If None, a new model will be trained. Default is None.

  • value (str) – Name of the target variable.

  • feature_names (list of str) – List of feature names.

  • variables_resample (list of str) – List of variables for normalisation.

  • split_method (str, optional) – Method for splitting data (‘random’ or ‘time_series’). Default is ‘random’.

  • fraction (float, optional) – Fraction of the dataset to be used for training. Default is 0.75.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • n_samples (int, optional) – Number of samples for normalisation. Default is 300.

  • seed (int, optional) – Seed for random operations. Default is 7654321.

  • n_cores (int, optional) – Number of CPU cores to be used for normalisation. Default is total CPU cores minus one.

  • weather_df (pandas.DataFrame, optional) – DataFrame containing weather data for resampling. Default is None.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

Transformed dataset with normalised values and DataFrame containing model statistics.

Return type:

tuple (pandas.DataFrame, pandas.DataFrame)

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
variables_resample = ['feature1', 'feature2']
df_dew, mod_stats = nm.do_all(df, value=value, feature_names=feature_names, variables_resample=variables_resample)

Notes:

  • If a model is not provided, the function will train a new model using the specified parameters.

  • Model statistics are collected for testing, training, and the entire dataset.

  • The function uses the specified number of CPU cores for normalisation, defaulting to one less than the total number of cores.

  • If a weather DataFrame is provided, it is used for resampling meteorological parameters; otherwise, the input DataFrame is used.

  • Progress messages are printed if verbose is set to True.

do_all_unc(df=None, value=None, feature_names=None, variables_resample=None, split_method='random', fraction=0.75, model_config=None, n_samples=300, n_models=10, confidence_level=0.95, seed=7654321, n_cores=None, weather_df=None, verbose=True)

Performs uncertainty quantification by training multiple models with different random seeds and calculates statistical metrics.

Parameters:
  • df (pandas.DataFrame) – Input dataframe containing the time series data.

  • value (str) – Column name of the target variable.

  • feature_names (list of str) – List of feature column names.

  • variables_resample (list of str) – List of sampled feature names for normalisation.

  • split_method (str, optional) – Method to split the data (‘random’ or other methods). Default is ‘random’.

  • fraction (float, optional) – Fraction of data to be used for training. Default is 0.75.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • n_samples (int, optional) – Number of samples for normalisation. Default is 300.

  • n_models (int, optional) – Number of models to train for uncertainty quantification. Default is 10.

  • confidence_level (float, optional) – Confidence level for the uncertainty bounds. Default is 0.95.

  • seed (int, optional) – Random seed for reproducibility. Default is 7654321.

  • n_cores (int, optional) – Number of cores to be used. Default is total CPU cores minus one.

  • weather_df (pandas.DataFrame, optional) – DataFrame containing weather data for resampling. Default is None.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

A tuple containing a DataFrame with normalised values and a DataFrame with model statistics.

Return type:

tuple (pandas.DataFrame, pandas.DataFrame)

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
variables_resample = ['feature1', 'feature2']
df_dew, mod_stats = nm.do_all_unc(df, value=value, feature_names=feature_names, variables_resample=variables_resample)

Notes:

  • Multiple models are trained using different random seeds to quantify uncertainty.

  • If verbose is True, progress messages are printed.

  • normalisation is performed using the specified number of CPU cores, with the default being the total number of cores minus one.

  • If a weather DataFrame is provided, it is used for resampling meteorological parameters; otherwise, the input DataFrame is used.

decom_emi(df=None, model=None, value=None, feature_names=None, split_method='random', fraction=0.75, model_config=None, n_samples=300, seed=7654321, n_cores=None, verbose=True)

Decomposes a time series into different components using machine learning models.

This function prepares the data, trains a machine learning model using AutoML, and decomposes the time series data into various components. The decomposition is based on the contribution of different features to the target variable. It returns the decomposed data and model statistics.

Parameters:
  • df (pandas.DataFrame) – Input dataframe containing the time series data.

  • model (object, optional) – Pre-trained model to use for decomposition. If None, a new model will be trained. Default is None.

  • value (str) – Column name of the target variable.

  • feature_names (list of str) – List of feature column names.

  • split_method (str, optional) – Method to split the data (‘random’ or other methods). Default is ‘random’.

  • fraction (float, optional) – Fraction of data to be used for training. Default is 0.75.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • n_samples (int, optional) – Number of samples for normalisation. Default is 300.

  • seed (int, optional) – Random seed for reproducibility. Default is 7654321.

  • n_cores (int, optional) – Number of cores to be used. Default is total CPU cores minus one.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

A tuple containing a dataframe with decomposed components and a dataframe with model statistics.

Return type:

tuple (pd.DataFrame, pd.DataFrame)

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
df_dewc, mod_stats = nm.decom_emi(df, value, feature_names)

Details:

  • If no pre-trained model is provided, the function will prepare the data and train a new model using AutoML.

  • The function gathers model statistics for testing, training, and the entire dataset.

  • The time series is decomposed by excluding different features iteratively.

  • The decomposed components are adjusted to create deweathered values.

  • The results include the decomposed dataframe and model statistics for further analysis.

decom_met(df=None, model=None, value=None, feature_names=None, split_method='random', fraction=0.75, model_config=None, n_samples=300, seed=7654321, importance_ascending=False, n_cores=None, verbose=True)

Decomposes a time series into different components using machine learning models with feature importance ranking.

This function prepares the data, trains a machine learning model using AutoML, and decomposes the time series data into various components. The decomposition is based on the feature importance ranking and their contributions to the target variable. It returns the decomposed data and model statistics.

Parameters:
  • df (pandas.DataFrame) – Input dataframe containing the time series data.

  • model (object, optional) – Pre-trained model to use for decomposition. If None, a new model will be trained. Default is None.

  • value (str) – Column name of the target variable.

  • feature_names (list of str) – List of feature column names.

  • split_method (str, optional) – Method to split the data (‘random’ or other methods). Default is ‘random’.

  • fraction (float, optional) – Fraction of data to be used for training. Default is 0.75.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • n_samples (int, optional) – Number of samples for normalisation. Default is 300.

  • seed (int, optional) – Random seed for reproducibility. Default is 7654321.

  • importance_ascending (bool, optional) – Sort order for feature importances. Default is False.

  • n_cores (int, optional) – Number of cores to be used. Default is total CPU cores minus one.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

A dataframe with decomposed components and a dataframe with model statistics.

Return type:

tuple (pd.DataFrame, pd.DataFrame)

Example:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
df_dewwc, mod_stats = nm.decom_met(df, value, feature_names)

Details:

  • If no pre-trained model is provided, the function will prepare the data and train a new model using AutoML.

  • The function gathers model statistics for testing, training, and the entire dataset.

  • Feature importances are determined and sorted based on their contribution to the target variable.

  • The time series is decomposed by excluding different features iteratively, according to their importance.

  • The decomposed components are adjusted to create weather-independent values.

  • The results include the decomposed dataframe and model statistics for further analysis.

rolling_dew(df=None, model=None, value=None, feature_names=None, split_method='random', fraction=0.75, model_config=None, n_samples=300, window_days=14, rolling_every=7, seed=7654321, n_cores=None, verbose=True)

Applies a rolling window approach to decompose the time series into different components using machine learning models.

This function prepares the data, trains a machine learning model using AutoML, and applies a rolling window approach to decompose the time series data into various components. The decomposition is based on the contribution of different features to the target variable. It returns the decomposed data and model statistics.

Parameters:
  • df (pandas.DataFrame) – Input dataframe containing the time series data.

  • model (object, optional) – Pre-trained model to use for decomposition. If None, a new model will be trained. Default is None.

  • value (str) – Column name of the target variable.

  • feature_names (list of str) – List of feature column names.

  • split_method (str, optional) – Method to split the data (‘random’ or other methods). Default is ‘random’.

  • fraction (float, optional) – Fraction of data to be used for training. Default is 0.75.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

  • n_samples (int, optional) – Number of samples for normalisation. Default is 300.

  • window_days (int, optional) – Number of days for the rolling window. Default is 14.

  • rolling_every (int, optional) – Rolling interval in days. Default is 7.

  • seed (int, optional) – Random seed for reproducibility. Default is 7654321.

  • n_cores (int, optional) – Number of cores to be used. Default is total CPU cores minus one.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

Returns:

Tuple containing: - df_dew (pd.DataFrame): Dataframe with decomposed components including mean and standard deviation of the rolling window. - mod_stats (pd.DataFrame): Dataframe with model statistics.

Details:

  • Data Preparation: Prepares the input data for modeling and optionally trains a new model using AutoML.

  • Model Training: Trains or uses the provided model to learn the relationship between features and the target variable.

  • Rolling Window Decomposition: Applies a rolling window approach to decompose the time series into components over specified windows and intervals.

  • Feature normalisation: Normalises the data within each rolling window using normalise function.

  • Component Calculation: Calculates mean and standard deviation of the rolling window to derive short-term and seasonal components.

  • Returns decomposed data (df_dew) including observed, short-term, seasonal components, and statistics (mod_stats) for evaluation.

Example:

  • Useful for analyzing time series data with varying patterns over time and decomposing it into interpretable components.

  • Supports dynamic assessment of feature contributions to the target variable across different rolling windows.

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
value = 'target'
feature_names = ['feature1', 'feature2', 'feature3']
df_dew, mod_stats = nm.rolling_dew(df, value, feature_names, window_days=14, rolling_every=2)

Notes:

  • Enhances understanding of time series data by breaking down its components over sliding windows.

  • Facilitates evaluation of model performance and feature relevance across different temporal contexts.

modStats(df, model, set=None, statistic=None)

Calculates statistics for model evaluation based on provided data.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • model (object) – Trained ML model.

  • set (str, optional) – Set type for which statistics are calculated (‘training’, ‘testing’, or ‘all’). Default is None.

  • statistic (list of str, optional) – List of statistics to calculate. Default is [“n”, “FAC2”, “MB”, “MGE”, “NMB”, “NMGE”, “RMSE”, “r”, “COE”, “IOA”, “R2”].

Returns:

DataFrame containing calculated statistics.

Return type:

pandas.DataFrame

Example:

Calculates statistics for a trained model on testing dataset:

import pandas as pd
import normet as nm
df = pd.read_csv('timeseries_data.csv')
model = nm.train_model(df, 'target', feature_names)
stats = nm.modStats(df, model, set='testing')

Notes:

  • If set parameter is provided, the function filters the DataFrame df to include only rows where the ‘set’ column matches set.

  • Raises a ValueError if set parameter is provided but ‘set’ column is not present in df.

  • Calculates statistics such as ‘n’, ‘FAC2’, ‘MB’, ‘MGE’, ‘NMB’, ‘NMGE’, ‘RMSE’, ‘r’, ‘COE’, ‘IOA’, ‘R2’ based on model predictions (‘value_predict’) and observed values (‘value’) in the DataFrame.

Stats(df, mod, obs, statistic=None)

Calculates specified statistics based on provided data.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • mod (str) – Column name of the model predictions.

  • obs (str) – Column name of the observed values.

  • statistic (list of str, optional) – List of statistics to calculate. Default is [“n”, “FAC2”, “MB”, “MGE”, “NMB”, “NMGE”, “RMSE”, “r”, “COE”, “IOA”, “R2”].

Returns:

DataFrame containing calculated statistics.

Return type:

pandas.DataFrame

Details:

This function calculates a range of statistical metrics to evaluate the model predictions against the observed values. The following statistics can be calculated:

  • n: Number of observations.

  • FAC2: Factor of 2.

  • MB: Mean Bias.

  • MGE: Mean Gross Error.

  • NMB: Normalised Mean Bias.

  • NMGE: Normalised Mean Gross Error.

  • RMSE: Root Mean Square Error.

  • r: Pearson correlation coefficient.

  • COE: Coefficient of Efficiency.

  • IOA: Index of Agreement.

  • R2: Coefficient of Determination (R-squared).

The significance level of the correlation coefficient (p-value) is also evaluated and indicated with symbols:

  • “” : p >= 0.1 (not significant)

  • “+” : 0.1 > p >= 0.05 (marginally significant)

  • “*” : 0.05 > p >= 0.01 (significant)

  • “**” : 0.01 > p >= 0.001 (highly significant)

  • “***” : p < 0.001 (very highly significant)

Example:

import pandas as pd
import normet as nm
data = {
         'observed': [1, 2, 3, 4, 5],
         'predicted': [1.1, 1.9, 3.2, 3.8, 5.1]
 }
df = pd.DataFrame(data)
stats = nm.Stats(df, mod='predicted', obs='observed')
print(stats)

Notes:

  • Each statistical metric has a specific function that calculates its value.

  • The function returns a DataFrame with the calculated statistics.

  • Significance levels for the correlation coefficient are marked with appropriate symbols.

pdp(df, model, variables=None, training_only=True, n_cores=None)

Computes partial dependence plots for all specified features.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • model – AutoML model object.

  • variables (list, optional) – List of variables to compute partial dependence plots for. If None, defaults to feature_names.

  • training_only (bool, optional) – If True, computes partial dependence plots only for the training set. Default is True.

  • n_cores (int, optional) – Number of CPU cores to use. Default is total CPU cores minus one.

Returns:

DataFrame containing the computed partial dependence plots for all specified features.

Return type:

pandas.DataFrame

Example:

import pandas as pd
from flaml import AutoML
import normet as nm  # Replace with the actual module name if different

# Load dataset
df = pd.read_csv('path_to_your_dataset.csv')

# Initialize AutoML model
automl = AutoML()

# Fit the model (assuming the model has a fit method)
automl.fit(df.drop(columns=['target']), df['target'])

# Compute Partial Dependence Plots for specific features
df_predict = nm.pdp(df, automl, variables=['feature1', 'feature2', 'feature3'])

# Display the resulting DataFrame
print(df_predict)
pdp_worker(X_train, model, variable, training_only=True)

Worker function for computing partial dependence plots for a single feature.

Parameters:
  • model – AutoML model object.

  • X_train (pandas.DataFrame) – Input DataFrame containing the training data.

  • variable (str) – Name of the feature to compute partial dependence plot for.

  • training_only (bool, optional) – If True, computes partial dependence plot only for the training set. Default is True.

Returns:

DataFrame containing the computed partial dependence plot for the specified feature.

Return type:

pandas.DataFrame

Example:

import pandas as pd
from flaml import AutoML
import normet as nm

# Load training data
X_train = pd.read_csv('path_to_your_training_data.csv')

# Initialize AutoML model
automl = AutoML()

# Fit the model (assuming the model has a fit method)
automl.fit(X_train.drop(columns=['target']), X_train['target'])

# Compute Partial Dependence Plot for a single feature
df_predict = nm.pdp_worker(X_train, automl, variable='feature1')

# Display the resulting DataFrame
print(df_predict)
scm_all(df, poll_col, code_col, control_pool, cutoff_date, n_cores=None)

Performs Synthetic Control Method (SCM) in parallel for multiple treatment targets.

Parameters:
  • df (str) – Input DataFrame containing the dataset.

  • poll_col – Name of the column containing the poll data.

  • code_col (str) – Name of the column containing the code data.

  • control_pool (list) – List of control pool codes.

  • cutoff_date (str) – Date for splitting pre- and post-treatment datasets.

  • n_cores (int, optional) – Number of CPU cores to use. Default is total CPU cores minus one.

Returns:

DataFrame containing synthetic control results for all treatment targets.

Return type:

pandas.DataFrame

Example:

import pandas as pd
import normet as nm

# Example data
df = pd.DataFrame({
    'date': pd.date_range(start='2020-01-01', periods=100, freq='D'),
    'pollutant': np.random.randn(100),
    'unit_code': ['A'] * 25 + ['B'] * 25 + ['C'] * 25 + ['D'] * 25
})

# Perform SCM in parallel for multiple treatment targets
synthetic_all = nm.scm_all(df, poll_col='pollutant', code_col='unit_code', control_pool=['B', 'C', 'D'], cutoff_date='2020-01-01', n_cores=4)
scm(df, poll_col, code_col, treat_target, control_pool, cutoff_date)

Performs Synthetic Control Method (SCM) for a single treatment target.

Parameters:
  • df (str) – Input DataFrame containing the dataset.

  • poll_col – Name of the column containing the poll data.

  • code_col (str) – Name of the column containing the code data.

  • treat_target (str) – Code of the treatment target.

  • control_pool (list) – List of control pool codes.

  • cutoff_date (str) – Date for splitting pre- and post-treatment datasets.

Returns:

DataFrame containing synthetic control results for the specified treatment target.

Return type:

pandas.DataFrame

Example:

import pandas as pd
import normet as nm

# Example data
df = pd.DataFrame({
    'date': pd.date_range(start='2020-01-01', periods=100, freq='D'),
    'pollutant': np.random.randn(100),
    'unit_code': ['A'] * 25 + ['B'] * 25 + ['C'] * 25 + ['D'] * 25
})

# Perform SCM for a single treatment target
synthetic_data = nm.scm(df, poll_col='pollutant', code_col='unit_code', treat_target='A', control_pool=['B', 'C', 'D'], cutoff_date='2020-01-01')
mlsc(df, poll_col, date_col, code_col, treat_target, control_pool, cutoff_date, model_config)

Performs synthetic control using machine learning regression models.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • poll_col (str) – Name of the column containing the poll data.

  • date_col (str) – Name of the column containing the date data.

  • code_col (str) – Name of the column containing the code data.

  • treat_target (str) – Code of the treatment target.

  • control_pool (list) – List of control pool codes.

  • cutoff_date (str) – Date for splitting pre- and post-treatment datasets.

  • model_config (dict, optional) – Configuration dictionary for model training parameters.

Returns:

DataFrame containing synthetic control results for the specified treatment target.

Return type:

pandas.DataFrame

Example:

import normet as nm
synthetic_result = nm.mlsc(df, poll_col='poll', date_col='date', code_col='code', treat_target='X', control_pool=['A', 'B', 'C'], cutoff_date='2020-01-01')

Notes:

  • The default model_config includes:

model_config = {
'time_budget': 90,                     # Total running time in seconds
'metric': 'r2',                        # Primary metric for regression
'estimator_list': ["lgbm"],            # List of ML learners: "lgbm", "rf", "xgboost", "extra_tree", "xgb_limitdepth"
'task': 'regression',                  # Task type
'verbose': verbose                     # Print progress messages
}
  • This configuration can be updated with user-provided model_config.

mlsc_all(df, poll_col, date_col, code_col, control_pool, cutoff_date, training_time=60, n_cores=None)

Performs synthetic control using machine learning regression models in parallel for multiple treatment targets.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the dataset.

  • poll_col (str) – Name of the column containing the poll data.

  • date_col (str) – Name of the column containing the date data.

  • code_col (str) – Name of the column containing the code data.

  • control_pool (list) – List of control pool codes.

  • cutoff_date (str) – Date for splitting pre- and post-treatment datasets.

  • training_time (int, optional) – Total running time in seconds for the AutoML model. Default is 60.

  • n_cores (int, optional) – Number of CPU cores to use. Default is total CPU cores minus one.

Returns:

DataFrame containing synthetic control results for all treatment targets.

Return type:

pandas.DataFrame

Example:

import normet as nm
synthetic_results = nm.mlsc_all(df, poll_col='poll', date_col='date', code_col='code', control_pool=['A', 'B', 'C'], cutoff_date='2020-01-01', training_time=60)