Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting parameters used to normalise data in create_dataset #267

Closed
tk3016 opened this issue Jun 12, 2024 · 1 comment
Closed

Extracting parameters used to normalise data in create_dataset #267

tk3016 opened this issue Jun 12, 2024 · 1 comment

Comments

@tk3016
Copy link

tk3016 commented Jun 12, 2024

As far as I understand when one calls the 'create_dataset' method with normalize_input and normalize_output arguments set as True, I get the dataset with scaled train_inputs and train_labels. Is there any way to also retrieve the unscaled data or the parameters used to scale the data?

Thanks for your help.

Best wishes,
Tanuj

@tk3016 tk3016 changed the title Extracting parameters used to normalise data in create_data Extracting parameters used to normalise data in create_dataset Jun 12, 2024
@kaneyxx
Copy link

kaneyxx commented Jun 13, 2024

You can easily modify the create_dataset() function in utils.py or build one by yourself.
When return_stats=True, you will obtain a dictionary containing mean/std.
Here's a example (not very optimized):

def create_dataset(f, 
                   n_var=2, 
                   ranges=[-1,1],
                   train_num=1000, 
                   test_num=1000,
                   normalize_input=False,
                   normalize_label=False,
                   return_stats=False,
                   device='cpu',
                   seed=0):
    '''
    create dataset
    
    Args:
    -----
        f : function
            the symbolic formula used to create the synthetic dataset
        ranges : list or np.array; shape (2,) or (n_var, 2)
            the range of input variables. Default: [-1,1].
        train_num : int
            the number of training samples. Default: 1000.
        test_num : int
            the number of test samples. Default: 1000.
        normalize_input : bool
            If True, apply normalization to inputs. Default: False.
        normalize_label : bool
            If True, apply normalization to labels. Default: False.
        return_stats : bool
            If True, return mean and std of inputs and labels when normalize_label is True. Default: False.
        device : str
            device. Default: 'cpu'.
        seed : int
            random seed. Default: 0.
        
    Returns:
    --------
        dataset : dict
            Train/test inputs/labels are dataset['train_input'], dataset['train_label'],
                        dataset['test_input'], dataset['test_label']
        stats : dict (optional)
            Contains mean and std of inputs and labels if return_stats is True.
         
    Example
    -------
    >>> f = lambda x: torch.exp(torch.sin(torch.pi*x[:,[0]]) + x[:,[1]]**2)
    >>> dataset = create_dataset(f, n_var=2, train_num=100)
    >>> dataset['train_input'].shape
    torch.Size([100, 2])
    '''

    np.random.seed(seed)
    torch.manual_seed(seed)

    if len(np.array(ranges).shape) == 1:
        ranges = np.array(ranges * n_var).reshape(n_var,2)
    else:
        ranges = np.array(ranges)
        
    train_input = torch.zeros(train_num, n_var)
    test_input = torch.zeros(test_num, n_var)
    for i in range(n_var):
        train_input[:,i] = torch.rand(train_num,)*(ranges[i,1]-ranges[i,0])+ranges[i,0]
        test_input[:,i] = torch.rand(test_num,)*(ranges[i,1]-ranges[i,0])+ranges[i,0]
        
        
    train_label = f(train_input)
    test_label = f(test_input)
        
        
    def normalize(data, mean, std):
            return (data-mean)/std
            
    stats = {}
    
    if normalize_input == True:
        mean_input = torch.mean(train_input, dim=0, keepdim=True)
        std_input = torch.std(train_input, dim=0, keepdim=True)
        train_input = normalize(train_input, mean_input, std_input)
        test_input = normalize(test_input, mean_input, std_input)
        stats['mean_input'] = mean_input
        stats['std_input'] = std_input
        
    if normalize_label == True:
        mean_label = torch.mean(train_label, dim=0, keepdim=True)
        std_label = torch.std(train_label, dim=0, keepdim=True)
        train_label = normalize(train_label, mean_label, std_label)
        test_label = normalize(test_label, mean_label, std_label)
        stats['mean_label'] = mean_label
        stats['std_label'] = std_label

    dataset = {}
    dataset['train_input'] = train_input.to(device)
    dataset['test_input'] = test_input.to(device)

    dataset['train_label'] = train_label.to(device)
    dataset['test_label'] = test_label.to(device)

    if return_stats and (normalize_input or normalize_label):
        return dataset, stats
    else:
        return dataset

@tk3016 tk3016 closed this as completed Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants