Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix copied doc updates not insert #4729

Merged
merged 3 commits into from
Sep 12, 2024
Merged

Conversation

swheaton
Copy link
Contributor

@swheaton swheaton commented Aug 26, 2024

What changes are proposed in this pull request?

Another weird bug that doesn't cause issues really until to-come changes are introduced.

If you do this code (done 3 times with dataset clone)

doc_copy = doc.copy()
_id = bson.ObjectId()
doc_copy.id = _id

Then mongoengine sets _created to False which means it thinks it's an update object not a new one.
When you call save() it emits an upsert call instead of an insert.
OK, not so bad, maybe a little weird but whatever ...

But in #4597 @brimoor proposes an optimization where only changed fields are serialized to get the document. In combo, this causes all kinds of strange behavior. It just so happens to work in that PR because doc._changed_fields is uninitialized (a code smell in mongoengine, they have a TODO to clean it up...) and so doc._delta() returns the whole doc.

But say we cleared changed fields because we didn't know about this strange requirement.

import bson
import fiftyone.core.odm as foo

run_doc = foo.RunDocument(config={"foo": "bar"})
run_doc.save()
doc_copy = run_doc.copy()
doc_copy.id = bson.ObjectId()

doc_copy._clear_changed_fields()
doc_copy.version = "51.51"
print(doc_copy._get_changed_fields()  # ["version"]
doc_copy.save(upsert=True)
doc_copy.reload()

# Oops our config field is {} because it's not a changed field so update only wrote "version" field
assert doc_copy.config == {"foo": "bar"}

doc_copy.delete()
run_doc.delete()

Ok but we don't clear changed_field so we're fine? Nope, balancing on a thread due to another mongoengine weirdness where _get_changed_fields() can return something due to embedded documents being edited

import bson
import fiftyone as fo

ds = fo.Dataset()

# Pretending to clone the dataset doc
doc_copy = ds._doc.copy()
doc_copy.id = bson.ObjectId()
doc_copy.sample_collection_name=f"samples.{str(doc_copy.id)}"
doc_copy.name="blah"
doc_copy.slug="blah"

# Making an embedded document update
doc_copy.sample_fields[0].description = "blah"

# Wha? changed fields is actually ["sample_fields.0.description"] because
#  simple fields aren't tracked but embedded docs are
assert doc_copy._get_changed_fields() == []

ds.delete()
doc_copy.delete()

In mongoengine code, this appears to be band-aided with this:
which is what @brimoor 's optimization is trying to avoid

        # Handles cases where not loaded from_son but has _id
        doc = self.to_mongo()

How is this patch tested? If it is not, please explain why.

Added test for copy_with_new_id.

Ensured that cloning dataset uses INSERT methods not UPSERT. Added print in Document._save()

import fiftyone as fo

ds = fo.Dataset()
ds.clone("blah")

$$ INSERT SON([('_id', ObjectId('66cc999093e25d6a662346e4')), ('name', 'blah'), ('slug', 'blah'), ('version', '0.24.1'), ('created_at', datetime.datetime(2024, 8, 26, 15, 4, 48, 474872)), ('sample_collection_name', 'samples.66cc999093e25d6a662346e4'), ('persistent', False), ('group_media_types', {}), ('tags', []), ('info', {}), ('app_config', SON([('grid_media_field', 'filepath'), ('media_fallback', False), ('media_fields', ['filepath']), ('modal_media_field', 'filepath'), ('plugins', {})])), ('classes', {}), ('default_classes', []), ('mask_targets', {}), ('default_mask_targets', {}), ('skeletons', {}), ('sample_fields', [SON([('name', 'id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_id'), ('description', None), ('info', None)]), SON([('name', 'filepath'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'filepath'), ('description', None), ('info', None)]), SON([('name', 'tags'), ('ftype', 'fiftyone.core.fields.ListField'), ('embedded_doc_type', None), ('subfield', 'fiftyone.core.fields.StringField'), ('fields', []), ('db_field', 'tags'), ('description', None), ('info', None)]), SON([('name', 'metadata'), ('ftype', 'fiftyone.core.fields.EmbeddedDocumentField'), ('embedded_doc_type', 'fiftyone.core.metadata.Metadata'), ('subfield', None), ('fields', [SON([('name', 'size_bytes'), ('ftype', 'fiftyone.core.fields.IntField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'size_bytes'), ('description', None), ('info', None)]), SON([('name', 'mime_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'mime_type'), ('description', None), ('info', None)])]), ('db_field', 'metadata'), ('description', None), ('info', None)]), SON([('name', '_media_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_media_type'), ('description', None), ('info', None)]), SON([('name', '_rand'), ('ftype', 'fiftyone.core.fields.FloatField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_rand'), ('description', None), ('info', None)]), SON([('name', '_dataset_id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_dataset_id'), ('description', None), ('info', None)])]), ('frame_fields', []), ('saved_views', []), ('workspaces', []), ('annotation_runs', {}), ('brain_methods', {}), ('evaluations', {}), ('runs', {})])

^^UPDATES {'$set': {'last_loaded_at': datetime.datetime(2024, 8, 26, 15, 4, 48, 625081)}}

Previously,

^^UPDATES {'$set': SON([('name', 'blah2'), ('slug', 'blah2'), ('version', '0.24.1'), ('created_at', datetime.datetime(2024, 8, 26, 15, 7, 4, 166390)), ('sample_collection_name', 'samples.66cc9a1861678cf7500272b4'), ('persistent', False), ('app_config', SON([('grid_media_field', 'filepath'), ('media_fallback', False), ('media_fields', ['filepath']), ('modal_media_field', 'filepath'), ('plugins', {})])), ('sample_fields', [SON([('name', 'id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_id'), ('description', None), ('info', None)]), SON([('name', 'filepath'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'filepath'), ('description', None), ('info', None)]), SON([('name', 'tags'), ('ftype', 'fiftyone.core.fields.ListField'), ('embedded_doc_type', None), ('subfield', 'fiftyone.core.fields.StringField'), ('fields', []), ('db_field', 'tags'), ('description', None), ('info', None)]), SON([('name', 'metadata'), ('ftype', 'fiftyone.core.fields.EmbeddedDocumentField'), ('embedded_doc_type', 'fiftyone.core.metadata.Metadata'), ('subfield', None), ('fields', [SON([('name', 'size_bytes'), ('ftype', 'fiftyone.core.fields.IntField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'size_bytes'), ('description', None), ('info', None)]), SON([('name', 'mime_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'mime_type'), ('description', None), ('info', None)])]), ('db_field', 'metadata'), ('description', None), ('info', None)]), SON([('name', '_media_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_media_type'), ('description', None), ('info', None)]), SON([('name', '_rand'), ('ftype', 'fiftyone.core.fields.FloatField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_rand'), ('description', None), ('info', None)]), SON([('name', '_dataset_id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_dataset_id'), ('description', None), ('info', None)])])]), '$unset': {'group_media_types': 1, 'tags': 1, 'info': 1, 'classes': 1, 'default_classes': 1, 'mask_targets': 1, 'default_mask_targets': 1, 'skeletons': 1, 'frame_fields': 1, 'saved_views': 1, 'workspaces': 1, 'annotation_runs': 1, 'brain_methods': 1, 'evaluations': 1, 'runs': 1}}

^^UPDATES {'$set': {'last_loaded_at': datetime.datetime(2024, 8, 26, 15, 7, 4, 291793)}}

Summary by CodeRabbit

  • New Features

    • Introduced a method to duplicate documents with a new unique identifier, enhancing document management.
  • Bug Fixes

    • Improved ID generation for copied documents, reducing potential errors and improving maintainability.
  • Tests

    • Added new tests to ensure the functionality of copying documents with new IDs works as intended.

@swheaton swheaton requested a review from brimoor August 26, 2024 15:10
Copy link
Contributor

coderabbitai bot commented Aug 26, 2024

Walkthrough

The changes involve significant modifications to the document handling functionalities in the FiftyOne library. A new method, copy, is introduced to the Document class for creating copies of documents with unique identifiers. This change affects the cloning process in dataset and view management, ensuring that copied documents are distinct and properly marked as newly created. Additionally, a new test class has been added to verify the correct behavior of this functionality.

Changes

Files Change Summary
fiftyone/core/dataset.py Modified _clone_dataset_or_view, _clone_extras, and _clone_run to use copy(new_id=True).
fiftyone/core/odm/document.py Added copy method to the Document class for creating copies with new IDs.
tests/unittests/odm_tests.py Introduced DocumentTests class with test_doc_copy_with_new_id to validate new functionality.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Document
    participant Dataset

    User->>Dataset: Clone Document
    Dataset->>Document: Call copy(new_id=True)
    Document->>Document: Create new document instance
    Document->>Document: Generate new ObjectId
    Document->>Document: Set _created attribute to True
    Document-->>Dataset: Return new document
    Dataset-->>User: Provide cloned document
Loading

🐇 In the land of code, where documents play,
A new ID hops in, brightening the day.
With each little copy, fresh and anew,
The documents dance, as if they all grew.
So let’s celebrate this change with delight,
For every new clone brings joy to our sight! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 76cdadf and 581d3f9.

Files selected for processing (3)
  • fiftyone/core/dataset.py (3 hunks)
  • fiftyone/core/odm/document.py (2 hunks)
  • tests/unittests/odm_tests.py (2 hunks)
Additional comments not posted (6)
tests/unittests/odm_tests.py (1)

35-56: LGTM!

The test function is well-structured and covers the necessary test cases for the copy_with_new_id method.

The code changes are approved.

fiftyone/core/odm/document.py (1)

587-599: LGTM!

The method is well-implemented and follows the necessary steps to create a new document with a unique ID.

The code changes are approved.

fiftyone/core/dataset.py (4)

7776-7776: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned dataset document.

The code changes are approved.


7780-7781: LGTM!

The code correctly assigns the newly cloned dataset document to the variable dataset_doc.

The code changes are approved.


8252-8253: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned reference document.

The code changes are approved.


Line range hint 8257-8264: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned run document and handles copying the GridFS files.

The code changes are approved.

@swheaton swheaton mentioned this pull request Aug 26, 2024
7 tasks
@benjaminpkane
Copy link
Contributor

Nice find. Odd and opaque behavior.

It makes sense to me. Trying to grok things a bit...we still like regular copy() for other use cases? And this only impacts documents (not embedded documents, e.g. labels), correct?

Thinking out loud, what about a doc.copy(new_id=False) signature?

@brimoor
Copy link
Contributor

brimoor commented Aug 28, 2024

@swheaton +1 to both of @benjaminpkane's thoughts here:

  • Still trying to fully grok the implications of what you've found here. Are the other places where we use copy() okay?
  • I like folding this into a copy(new_id=True) syntax. Or, another option could be clone(), since when you dataset.clone() you're creating an identical but fully-independent copy of the dataset (with a new dataset ID).

But, I'm now wondering if we ever use copy() for the purposes of making edits to a doc that we intend to upsert in-place... 🤔

Okay, here's an interesting bit of code when dealing with embedded docs:

def _copy_labels(labels):
if labels is None:
return None
field = labels._LABEL_LIST_FIELD
_labels = labels.copy()
# We need the IDs to stay the same
for _label, label in zip(_labels[field], labels[field]):
_label.id = label.id
return _labels

The use case here is that we actually want a copy of the embedded docs with the same IDs because we intend to do in-memory computations on them but don't want the stuff we do to be tracked and persisted on the label objects in the database. Annnnd, this reminds me that apparently foo.EmbeddedDocument.copy() does not behave the same as foo.Document.copy(), it creates a new ID by default!

d = fo.Detection()
assert d.id == d.copy().id  # False!

And then there's fiftyone.core.document.Document.copy(), which does a third thing: it explicitly returns a document with id == None:

def copy(self, fields=None, omit_fields=None):
"""Returns a deep copy of the document that has not been added to the
database.
Args:
fields (None): an optional field or iterable of fields to which to
restrict the copy. This can also be a dict mapping existing
field names to new field names
omit_fields (None): an optional field or iterable of fields to
exclude from the copy
Returns:
a :class:`Document`
"""
raise NotImplementedError("subclass must implement copy()")

@swheaton
Copy link
Contributor Author

Somehow I missed the emails of these reviews coming in ...

Good ideas, I didn't love copy_with_new_id anyways.

foo.Document does behave that same way, copying resets the ID to None. It's just that in our clone_* implementations we want to know the ID before saving so that we can create sample collection name and set up references and stuff. So we don't just leave the ID as None which would cause an insert and new ID generated in the normal way.

But yes embedded document is different, not sure why exactly. But I don't see any further bugs due to this.

@swheaton
Copy link
Contributor Author

changed to copy(new_id=False)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 581d3f9 and 63a5cc4.

Files selected for processing (3)
  • fiftyone/core/dataset.py (3 hunks)
  • fiftyone/core/odm/document.py (2 hunks)
  • tests/unittests/odm_tests.py (2 hunks)
Files skipped from review as they are similar to previous changes (3)
  • fiftyone/core/dataset.py
  • fiftyone/core/odm/document.py
  • tests/unittests/odm_tests.py

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 63a5cc4 and 86aad06.

Files selected for processing (1)
  • fiftyone/core/odm/document.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • fiftyone/core/odm/document.py

Copy link
Contributor

@benjaminpkane benjaminpkane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@brimoor brimoor merged commit 8811fd9 into develop Sep 12, 2024
13 checks passed
@brimoor brimoor deleted the fix/mongoengine-document-misc-bugs branch September 12, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants