Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to create dataset from annotated images only #2466

Merged
merged 2 commits into from
Mar 15, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Be able to create dataset from annotated images only
Add the ability to create a dataset/splits only with images that have an annotation file, i.e a .txt file, associated to it. As we talked about this, the absence of a txt file could mean two things:

* either the image wasn't yet labelled by someone,
* either there is no object to detect.

When it's easy to create small datasets, when you have to create datasets with thousands of images (and more coming), it's hard to track where you at and you don't want to wait to have all of them annotated before starting to train. Which means some images would lack txt files and annotations, resulting in label inconsistency as you say in #2313. By adding the annotated_only argument to the function, people could create, if they want to, datasets/splits only with images that were labelled, for sure.
  • Loading branch information
kinoute committed Mar 14, 2021
commit 3424a3c4bbf7c4680b1e9848ed6e742cb63d489a
23 changes: 18 additions & 5 deletions utils/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -1033,19 +1033,32 @@ def extract_boxes(path='../coco128/'): # from utils.datasets import *; extract_
assert cv2.imwrite(str(f), im[b[1]:b[3], b[0]:b[2]]), f'box failure in {f}'


def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)): # from utils.datasets import *; autosplit('../coco128')
def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0), annotated_only=False): # from utils.datasets import *; autosplit('../coco128')

""" Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
# Arguments
path: Path to images directory
weights: Train, val, test weights (list)
path: Path to images directory
weights: Train, val, test weights (list)
annotated_only: Only use images with an annotated txt file
"""

path = Path(path) # images dir
files = list(path.rglob('*.*'))

# make sure we only work with images files
files = sum([list(path.rglob(f"*.{img_ext}")) for img_ext in img_formats], [])
n = len(files) # number of files

indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split

txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt'] # 3 txt files
[(path / x).unlink() for x in txt if (path / x).exists()] # remove existing

if annotated_only:
print("Only annotated images with a .txt file associated will be used to create the dataset")

for i, img in tqdm(zip(indices, files), total=n):
if img.suffix[1:] in img_formats:
# in case we want to use only annotated files
if not annotated_only or (annotated_only and Path(img2label_paths([str(img)])[0]).exists()):
with open(path / txt[i], 'a') as f:
f.write(str(img) + '\n') # add image to txt file