Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word counts #1

Closed
5 tasks done
mmaz opened this issue Jun 26, 2020 · 2 comments
Closed
5 tasks done

word counts #1

mmaz opened this issue Jun 26, 2020 · 2 comments
Assignees

Comments

@mmaz
Copy link
Collaborator

mmaz commented Jun 26, 2020

Really great job on kicking off the wordcount feature Tejas! Excited to see you making progress so fast. Some suggestions on next steps:

  • It looks like the current script produces a csv of wordcounts for an input list of keywords. I think what we're looking for is rather, a csv of wordcounts for all words present in the .tsv file (after they have been normalized with clean_and_filter). Let me know if you have questions about this
  • Excellent to see type annotations! Can you also add docstrings please?
  • use standard __main__ (link)
    • I think you can omit sys.argv since you're using argparse
  • format with black (https://github.com/psf/black)
  • rename the file to snakecasing (I have a bad habit of camelcasing .ipynb files but I think python files should be lowercased; eventually we will move several of these functions into a library)

Again, great job!! Let me know if you have any questions or if these suggestions don't make sense.

@tejasprabhune
Copy link
Contributor

I made the changes! Let me know what you think.

@mmaz
Copy link
Collaborator Author

mmaz commented Jun 27, 2020

Nice! One thing to consider is whether it would be better to run clean_and_filter on each input sentence, though (with some refactoring). In other words, will keyword_set have separate entries for lowercase and uppercase words, for example?

It might be helpful to also create a unit test to test some corner cases for this script, and also to document some shortcomings that we aren't currently addressing (such as not combining word stems, which is fine for now). For example:

input = """#sentence
Three apples, three oranges 3 pears & one more pear"""

Your unit test might verify we have

{ "three" : 2,
  "pears": 1,
  "pear": 1,
  "and": 0,
  ...}

@mmaz mmaz closed this as completed Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants