Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with highly redundant json documents. #10

Open
felix-hh opened this issue Nov 3, 2023 · 0 comments
Open

Dealing with highly redundant json documents. #10

felix-hh opened this issue Nov 3, 2023 · 0 comments

Comments

@felix-hh
Copy link

felix-hh commented Nov 3, 2023

Hi! The goal of this library is extremely cool and hits close to one of the use cases I am dealing with right now. Since you must be an expert in this domain, I wanted to ask you if you support my use case or know of any solution that does :)

If you're curious, this is the schema of the JSON files I am trying to ingest right now. It is a newline delimited .jsonl file.

The key problem: the json objects I am ingesting have a high level of redundancy. So for example, imagine I am processing root.nested_list[0] and root.nested_list[5], where the items are large Python objects with further nested lists. If they hold the same value, I would like them to have a single entry on the nested object table.

A visual sketch where I would like not to deduplicate repeated service codes. These codes may occur in the same list or in different in_network objects. :

"in_network": {
	"procedure": "BLOOD TEST",
	"negotiated_rates": [
		{
			"negotiation_type": "ffs",
			"service_code": [1,2,3] 
		},
		{
			"negotiation_type": "derived",
			"service_code": [4,2,3] 
		},
		{
			"negotiation_type": "ffs",
			"service_code": [1,2,3] 
		},
	]
}

// in_network
[
	{
	"procedure": "BLOOD TEST",
	"negotiated_rates": "R_969c799a3177437d98074d985861242b",
	}
]

// negotiated_rates
[
	{
	    "negotiated_rates__rid_": "R_969c799a3177437d98074d985861242b",
	    "negotiated_rates__index_": 0,
	    "negotiated_rates__hashid_": "9b2d2b5023bd12081a441f15ddfb7725"
	},
	{
	    "negotiated_rates__rid_": "R_969c799a3177437d98074d985861242b",
	    "negotiated_rates__index_": 1,
	    "negotiated_rates__hashid_": "f3b2d2b5023bd12081a441f15ddfb7611"
	},
	{
	    "negotiated_rates__rid_": "R_969c799a3177437d98074d985861242b",
	    "negotiated_rates__index_": 2,
	    "negotiated_rates__hashid_": "9b2d2b5023bd12081a441f15ddfb7725"
	}
]

// negotiated_rate
// note: not creating table for the service_code list for readability.
[
	{
		"negotiated_rate__hashid_": "9b2d2b5023bd12081a441f15ddfb7725",
		"negotiated_rate_service_code": [1,2,3],
		"negotiated_rate_negotiation_type": "ffs", 
		
	},
	{
		"negotiated_rate__hashid_": "9b2d2b5023bd12081a441f15ddfb7725",
		"negotiated_rate_service_code": [4,2,3],
		"negotiated_rate_negotiation_type": "derived", 
		
	}
]

Note how the negotiated_rates are deduplicated in the negotiated_rate table.

I believe your existing solution does not tackle this issue, is this correct? Do you happen to know any tool that is available that can deal with this issue of deduplication on ingestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant