merge develop

HumanSignal · Dec 7, 2023 · ee9c570 · ee9c570
2 parents f9d0609 + 36e11d3
commit ee9c570
Show file tree

Hide file tree

Showing 35 changed files with 685 additions and 144 deletions.
diff --git a/deploy/uwsgi.ini b/deploy/uwsgi.ini
@@ -1,6 +1,6 @@
 [uwsgi]
 chdir = /label-studio/label_studio
-http = [::]:8000
+http = :8000
 module = core.wsgi:application
 master = true
 cheaper = true

diff --git a/docs/source/guide/dataset_create.md b/docs/source/guide/dataset_create.md
@@ -1,25 +1,30 @@
 ---
-title: Create a dataset
-short: Create a dataset
-date: 2023-08-16 11:52:38
+title: Create a dataset for Data Discovery - Beta 🧪
+short: Import unstructured data
 tier: enterprise
+type: guide
 order: 0
 order_enterprise: 205
-meta_title: Create a Dataset to use with data discovery in Label Studio Enterprise
-meta_description: How to create a Dataset in Label Studio Enterprise using Google Cloud, Azure, or AWS.
-hide_sidebar: true
+meta_title: Create a dataset to use with Data Discovery in Label Studio Enterprise
+meta_description: How to create a dataset in Label Studio Enterprise using Google Cloud, Azure, or AWS.
+section: "Data Discovery"
+date: 2023-08-16 11:52:38
 ---
 
 !!! note
-    At this time, we only support building datasets from a bucket of unstructured data, meaning that the data must be in individual files rather than a structured format such as CSV or JSON.
-
-!!! note
-    To create a new Dataset, your [user role](manage_users#Roles-in-Label-Studio-Enterprise) must have Owner or Administrator permissions. 
+    * At this time, we only support building datasets from a bucket of unstructured data, meaning that the data must be in individual files rather than a structured format such as CSV or JSON.
+    * To create a new dataset, your [user role](manage_users#Roles-in-Label-Studio-Enterprise) must have Owner or Administrator permissions. 
 
 ## Before you begin
 
 Datasets are retrieved from your cloud storage environment. As such, you will need to provide the appropriate access key to pull data from your cloud environment.
 
+If you are using a firewall, ensure you whitelist the following IP addresses (in addition to the [app.humansignal.com range](saas#IP-Range)):
+
+`34.85.250.235`  
+`35.245.250.139`  
+`35.188.239.181`
+
 ## Datasets using AWS
 
 Requirements:
@@ -187,15 +192,15 @@ This user can be tied to a specific person or a group.
     | Bucket Name | Enter the name of the AWS S3 bucket. |
     | Bucket Prefix | Enter the folder name within the bucket that you would like to use.  For example, `data-set-1` or `data-set-1/subfolder-2`.  |
     | File Name Filter | Use glob format to filter which file types to sync. For example, to sync all JPG files, enter `*jpg`. To sync all JPG and PNG files, enter `**/**+(jpg\|png)`.<br><br>At this time, we support the following file types: .jpg, .jpeg, .png, .txt, .text |
-    | Region Name | By default, the region is `us-east-1`. If your bucket is located in a different region, overwrite the default and enter your region here. Otherwise, keep the default  |
+    | Region Name | By default, the region is `us-east-1`. If your bucket is located in a different region, overwrite the default and enter your region here. Otherwise, keep the default.  |
     | S3 Endpoint | Enter an S3 endpoint if you want to override the URL created by S3 to access your bucket. |
     | Access Key ID | Enter the ID for the access key you created in AWS. Ensure this access key has read permissions for the S3 bucket you are targeting (see [Create an AWS access key](#Create-a-policy-for-the-user) above). |
     | Secret Access Key | Enter the secret portion of the [access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) you created earlier. |
     | Session Token | If you are using a session token as part of your authorization (for example, [MFA](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa.html)), enter it here. |
     | Treat every bucket object as a source file | **Enabled** - Each object in the bucket will be imported as a separate record in the dataset.<br>You should leave this option enabled if you are importing a bucket of unstructured data files such as JPG, PNG, or TXT. <br><br>**Disabled** - Disable this option if you are importing structured data, such as JSON or CSV files.<br><br>**NOTE:** At this time, we only support unstructured data. Structured data support is coming soon.  |
     | Recursive scan | Perform recursive scans over the bucket contents if you have nested folders in your S3 bucket. |
     | Use pre-signed URLs | If your tasks contain `s3://…` links, they must be pre-signed in order to be displayed in the browser. |
-    | Pre-signed URL counter | Adjust the counter for how many minutes the pre-signed URLs are valid. |
+    | Expiration minutes | Adjust the counter for how many minutes the pre-signed URLs are valid. |
 
     </div>
 
@@ -213,8 +218,6 @@ Data sync initializes immediately after creating the dataset. Depending on how m
 
 
 
-
-
 ## Datasets using Google Cloud Storage
 
 Requirements:
@@ -356,3 +359,94 @@ Data sync initializes immediately after creating the dataset. Depending on how m
 
 
 
+## Datasets using Microsoft Azure 
+
+Requirements:
+
+- Your data is saved as blobs in an Azure storage account. We do not currently support Azure Data Lake.
+- You have access to retrieve the [storage account access key](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage). 
+- Your storage container has CORS configured properly. Configuring CORS allows you to view the data in Label Studio. When CORS is not configured, you are only able to view links to the data. 
+
+{% details <b>Configure CORS for the Azure storage account</b> %}
+
+
+Configure CORS at the storage account level. 
+
+1. In the Azure portal, navigate to the page for the storage account. 
+2. From the menu on the left, scroll down to **Settings > Resource sharing (CORS)**. 
+3. Under **Blob service** add the following rule:
+
+   * **Allowed origins:** `*` 
+   * **Allowed methods:** `GET` 
+   * **Allowed headers:** `*` 
+   * **Exposed headers:** `Access-Control-Allow-Origin` 
+   * **Max age:** `3600` 
+
+4. Click **Save**. 
+
+![Screenshot of the Azure portal page for configuring CORS](/images/azure-storage-cors.png)
+
+
+{% enddetails %}
+
+{% details <b>Retrieve the Azure storage access key</b> %}
+
+###### Get the Azure storage account access key
+
+When you create a storage account, Azure automatically generates two keys that will provide access to objects within that storage account. For more information about keys, see [Azure documentation - Manage storage account access keys](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage). 
+
+1. Navigate to the storage account page in the portal. 
+2. From the menu on the left, scroll down to **Security + networking > Access keys**. 
+3. Copy the **key** value for either Key 1 or Key 2. 
+
+![Screenshot of the Azure portal access keys page](/images/azure-access-key.png)
+
+
+{% enddetails %}
+
+### Create a dataset from an Azure blob storage container
+
+1. From Label Studio, navigate to the Datasets page and click **Create Dataset**. 
+
+    ![Create a dataset action](/images/data_discovery/dataset_create.png)
+
+2. Complete the following fields and then click **Next**:
+
+    <div class="noheader rowheader">
+
+    | | |
+    | --- | --- |
+    | Name | Enter a name for the dataset. |
+    | Description | Enter a brief description for the dataset.  |
+    | Source | Select Microsoft Azure. |
+
+    </div>
+
+3. Complete the following fields: 
+
+    <div class="noheader rowheader">
+
+    | | |
+    | --- | --- |
+    | Container Name | Enter the name of a container within the Azure storage account. |
+    | Container Prefix | Enter the folder name within the container that you would like to use.  For example, `data-set-1` or `data-set-1/subfolder-2`.  |
+    | File Name Filter | Use glob format to filter which file types to sync. For example, to sync all JPG files, enter `*jpg`. To sync all JPG and PNG files, enter `**/**+(jpg\|png)`.<br><br>At this time, we support the following file types: .jpg, .jpeg, .png, .txt, .text |
+    | Account Name |  Enter the name of the Azure storage account. |
+    | Account key | Enter the access key for the Azure storage account (see [Retrieve the Azure storage access key](#Get-the-Azure-storage-account-access-key) above). |
+    | Treat every bucket object as a source file | **Enabled** - Each object in the bucket will be imported as a separate record in the dataset.<br>You should leave this option enabled if you are importing a bucket of unstructured data files such as JPG, PNG, or TXT. <br><br>**Disabled** - Disable this option if you are importing structured data, such as JSON or CSV files.<br><br>**NOTE:** At this time, we only support unstructured data. Structured data support is coming soon.  |
+    | Use pre-signed URLs | If your tasks contain `azure-blob://…` links, they must be pre-signed in order to be displayed in the browser. |
+    | Expiration minutes | Adjust the counter for how many minutes the pre-signed URLs are valid. |
+
+    </div>
+
+4. Click **Check Connection** to verify your credentials. If your connection is valid, click **Next**. 
+
+    ![Check Dataset connection](/images/data_discovery/dataset_check_connection_azure.png)
+
+5. Provide a name for your dataset column and select a data type. The data type that you select tells Label Studio how to store your data in a way that is [searchable](dataset_search).
+
+    ![Select dataset column](/images/data_discovery/dataset_column_azure.png)
+
+6. Click **Create Dataset**. 
+
+Data sync initializes immediately after creating the dataset. Depending on how much data you have, syncing might take several minutes to complete.
diff --git a/docs/source/guide/dataset_manage.md b/docs/source/guide/dataset_manage.md
@@ -1,13 +1,14 @@
 ---
-title: Manage datasets
+title: Manage datasets for Data Discovery - Beta 🧪
 short: Manage datasets
 tier: enterprise
+type: guide
 order: 0
-order_enterprise: 210
+order_enterprise: 215
 meta_title: Manage a dataset in Label Studio Enterprise
 meta_description: How to manage your datasets in Label Studio Enterprise 
 date: 2023-08-23 12:07:13
-hide_sidebar: true
+section: "Data Discovery"
 ---
 
 
@@ -17,18 +18,42 @@ From the Datasets page, click the overflow menu next to dataset and select **Set
 
 ![Overflow menu next to a dataset](/images/data_discovery/dataset_settings.png)
 
-From here you can do the following:
 
-- Edit the dataset name and description.
-- Edit the storage settings. If you edit the storage settings, you will need to re-enter your cloud credentials. For information about the storage setting fields, see their descriptions in [Create a dataset](dataset_create).
+| Settings page &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Description |
+| ---------------- | --- |
+| **General**             | Edit the dataset name and description. |
+| **Storage** | Review the storage settings. For information about the storage setting fields, see their descriptions in [Create a dataset](dataset_create). |
+| **Members** | Manage dataset members. See [Add or remove members](#Add-or-remove-members).  |
+
+
 
 ## Create project tasks from a dataset 
 
-See [Semantic Search](dataset_search). 
+Select the records you want to annotate and click ***n* Records**. From here you can select a project or you can create a new project. 
+
+The selected records are added to the project as individual tasks. 
+
+![Screenshot of the button to add tasks to project](/images/data_discovery/add_tasks.png)
+
+## Add or remove members
+
+From here you can add and remove members. Only users in the Manager role can be added or removed from a dataset. Reviewers and Annotators cannot be dataset members. 
+
+By default, all Owner or Administrator roles are dataset members and cannot be removed. 
+
+| Permission | Roles&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |
+| ---------------- | --- |
+| **Create a dataset** | Owner <br><br>Administrator |
+| **Delete a dataset** | Owner <br><br>Administrator |
+| **View and update dataset settings** | Owner <br><br>Administrator |
+| **View and search dataset** | Owner <br><br>Administrator <br><br>Manager |
+| **Export records to projects** | Owner <br><br>Administrator <br><br>Manager |
+
+
 
 
 ## Delete a dataset
 
-From the Datasets page, select the overflow menu next to dataset and select **Delete**.  
+From the Datasets page, select the overflow menu next to dataset and select **Delete**. A confirmation prompt appears. 
 
 Deleting a dataset does not affect any project tasks you created using the dataset.
diff --git a/docs/source/guide/dataset_overview.md b/docs/source/guide/dataset_overview.md
@@ -0,0 +1,69 @@
+---
+title: Data Discovery overview - Beta 🧪
+short: Overview
+tier: enterprise
+type: guide
+order: 0
+order_enterprise: 201
+meta_title: Data Discovery overview and features
+meta_description: An overview of Label Studio's Data Discovery functionality, including features and limitations. 
+section: "Data Discovery"
+date: 2023-11-10 15:23:18
+---
+
+> Streamline your data preparation process using Data Discovery in Label Studio. 
+
+!!! note Beta release
+    This feature is currently in beta. To enable Data Discovery, contact your customer success manager or email [cs@humansignal.com](mailto:cs@humansignal.com). 
+
+In machine learning, the quality and relevance of the data used for training directly affects model performance. However, sifting through extensive unstructured datasets to find relevant items can be cumbersome and time-consuming. 
+
+Label Studio's Data Discovery simplifies this by allowing users to perform targeted, [AI-powered searches](dataset_search) within their data. This is incredibly beneficial for projects where specific data subsets are required for training specialized models.
+
+For example, imagine a scenario in a retail context where a company wants to develop an AI model to recognize and categorize various products in their inventory. Using Label Studio's Data Discovery functionality, they can quickly gather images of specific product types from their extensive database, significantly reducing the time and effort needed for manual data labeling and sorting. This efficiency not only speeds up the model development process, but also enhances the model's accuracy by ensuring a well-curated training dataset.
+
+This targeted approach to data gathering not only saves valuable time but also contributes to the development of more accurate and reliable machine learning models.
+
+!!! info Tip
+    You can use the label distribution charts on a project's [dashboard](dashboards) to identify areas within the project that are underrepresented. You can then use Data Discovery to identify the appropriate dataset records to add to your project for more uniform coverage.
+
+
+#### Process overview
+
+1. Create a dataset by connecting your cloud environment to Label Studio and importing your data. See [Create datasets](dataset_create). 
+2. Use our AI-powered search to sort and filter the dataset. See [Search and filter datasets](dataset_search). 
+3. Select the data you want to use and add it to a labeling project. See [Manage datasets](dataset_manage). 
+4. Start labeling data! 
+
+## Terminology
+
+| Term | Description |
+| --- | --- |
+| **Dataset** | In general terms, a dataset is a collection of data. <br>When referred to here, it means a collection of data created using the Datasets page in Label Studio. |
+| **Data discovery** | In general terms, data discovery is the process of gathering, refining, and classifying data. A data discovery tool helps teams find relevant data for labeling. This covers a full spectrum of tasks, from finding data to include in your initial ground truth dataset to finding very specific data points to remedy underperforming classes or address edge cases.  |
+| **Natural language search** <br><br>**Semantic search**| These two terms are used interchangeably and, in simple terms, mean using text as the search query.|
+| **Similarity search** | Similarity search is when you select one or more records and then sort the dataset by similarity to your selections. |
+| **Record** | An item in a dataset. Each record can be added to a Label Studio project as a task. |
+
+
+## Features, requirements, and constraints
+
+<div class="noheader rowheader">
+
+| Feature | Support |
+| --- | --- |
+| **Supported file types** | .txt <br><br>.png <br><br>.jpg/.jpeg |
+| **Indexable/searchable data** | Image and text |
+| **Supported storage for import** | Google Cloud storage <br><br>AWS S3 <br><br>Azure blob storage |
+| **Number of storage sources per dataset** | One |
+| **Maximum number of records per dataset** | 1 million |
+| **Number of datasets per org** | 10 |
+| **Supported search types** | Natural language search <br><br>Similarity search |
+| **Supported filter types** | Similarity score |
+| **Required permissions** | **Owners and Administrators** -- Can create datasets and have full administrative access to any existing datasets <br><br>**Managers** -- Must be invited to a dataset. Once invited, they can view the dataset and export records as project tasks. Managers cannot create new datasets or perform administrative tasks on existing ones. <br><br>**Reviewers and Annotators** -- No access to datasets and cannot be added as dataset members.  |
+| **Enterprise vs. Open Source** | Label Studio Enterprise only |
+
+</div>
+
+
+