Skip to content

The task on Data Analyst Internship represented by Rambler&Co🧮

License

Notifications You must be signed in to change notification settings

JuiceFV/Data_Analyst_Internship_Task

Repository files navigation

Data Analyst Internship Task

Sup, Hi, Hello, Good (evening/morning/afternoon). I created this repository in purpose to study the LaTeX basis, due to the Microsoft Word's tools are not enough for me, and perhaps it will be helpful in my future education. The task has been presented by AAA Media Corporation Rambler Group. I'd like to use this repository as a representation of me (I mean how do I work) for the corporations and other people. Soooo, let's plunge into the tasks:)

Table Of Contents

First Task

This task checks whether how well you know the SQL, or how well you use Google, or both.

The Condition

Rambler Group's advertising campaign uses most fascinating and most memorable banners. Analytics have access to the databases containing data regard the banner' showing. The Shows_table contains:

  • show_id - an identifier of a showing
  • day - a day of a showig
show_id day
12367 2018-10-04
28736 2019-02-22
19862 2019-01-31

The Click_table contains:

  • click_id - a show identifier clicked by an user
  • bounce - an user dismissing from an advertising after click (0 - when an user relinked to the site he keened in the information on the site. 1 - an user immediatly left the site.)
show_id bounce
12367 1
28735 0
15627 0

You need to get all users who clicked at a banner in February 2020, and they din't reject an advertising.

The Solution

  1. First of all, I'd like to obtain all users, I mean the whole database of users, without any filters. But I banned by the task, because it is clearly said that we need to fetch the "users", however I do not really aware what exactly is the "users". Ensuing of this I decided to derive the show_id (it also can be the click_id, result will be the same).
SELECT show_id FROM Shows_table
  1. The next scanty modification is to retrieve unique show_id. For instance, if an user clicked twice on the same banner, it shouldn't be represented twice, therefore distinct it.
SELECT DISTINCT show_id FROM Shows_table
  1. The next step is to detect those who clicked and remained on the advertising site. The INER JOIN the most appropriating command for this purpose.

The INNER JOIN keyword selects records that have matching values in both tables

SELECT DISTINCT show_id FROM Shows_table
INNER JOIN Clicks_table ON
show_id = click_id AND bounce ='0'

Let's deem this query on the example

show_id day
12367 2018-10-04
15627 2020-02-22
28736 2019-01-31
show_id bounce
12367 1
28736 0
15627 0

In the above case the banners which ids are 15627 and 28735 are those banners that an user remained by clicking on. So, this request returns such list:

SELECT DISTINCT show_id FROM Shows_table
12367
28736
15627

If be honest this one does the same as the previous one

SELECT DISTINCT show_id FROM Shows_table
INNER JOIN Clicks_table ON
show_id = click_id

And the last thing in this subsection is to filter the case when an user remains on the advertising site. According the task "bounce - an user dismissing from an advertising after click (0 - when an user relinked to the site he keened in the information on the site. 1 - an user immediatly left the site.)" This request perfectly does this, and returns such id's list.

28736
15627
  1. The last what we should to do is to distill by date. The task says "You need to get all users who clicked at a banner in February 2020, and they din't reject an advertising.", it does pretty simple in SQL, by adding up this to the request.
... day BETWEEN '2020-02-01' AND '2020-02-29'
  1. The result looks like this:
SELECT DISTINCT show_id FROM Shows_table
INNER JOIN Clicks_table ON
show_id = click_id AND bounce ='0' AND day BETWEEN '2020-02-01' AND '2020-02-29'

In our example this query has to return only the 15627, because this is the only click_id within February 2020.

Testing

In purpose to test my solution I was using the SQLFiddle and PostgreSQL 9.6.

  1. I created the test-tables, using DLL, by these queries:
CREATE TABLE Shows_table (
    show_id INT ,
    day DATE
);
CREATE TABLE Clicks_table (
    click_id INT ,
    bounce BOOLEAN
);
  1. Fill them (tables) up, by using these requests
INSERT INTO Shows_table ( show_id, day) VALUES (12367, '2018-10-04');
INSERT INTO Shows_table ( show_id, day) VALUES (28736, '2019-02-22');
INSERT INTO Shows_table ( show_id, day) VALUES (19862, '2019-01-31');
INSERT INTO Shows_table ( show_id, day) VALUES (11111, '2020-02-04');
INSERT INTO Shows_table ( show_id, day) VALUES (22222, '2020-02-01');
INSERT INTO Shows_table ( show_id, day) VALUES (33333, '2020-02-29');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (12367, '1');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (28736, '0');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (19862, '0');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (11111, '1');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (22222, '0');
INSERT INTO Clicks_table ( click_id, bounce ) VALUES (33333, '0')

And now the tables looks like these:

show_id day
12367 2018-10-04
28736 2019-02-22
19862 2019-01-31
11111 2020-02-04
22222 2020-02-01
33333 2020-02-29
show_id bounce
12367 1
28736 0
19862 0
11111 1
22222 0
33333 0
  1. For this example the query (my solution) returns the 22222 and 33333.
  • 12367 - isn't suitable, because an user left the site and it was in October 2018
  • 28736 - isn't suitable, because it was in February 2019.
  • 19862 - isn't suitable, because it was in January 2019.
  • 11111 - isn't suitable, because an user left the site.

Answer

The answer for the First task

SELECT DISTINCT show_id FROM Shows_table
INNER JOIN Clicks_table ON
show_id = click_id AND bounce ='0' AND day BETWEEN '2020-02-01' AND '2020-02-29'

Second Task

This task checks your ability to use math.

The Condtition

The friendly Rambler Group's community likes to play in the table football: At the odd days they play before lunch, at the even days the play after lunch. They are splitting at the N teams among each other, and every team plays with each another team. Because of the splitting onto the teams is randomly, the product of the games is random. Also I would note that there are no ties. Only win or lose.

  1. Estimate the probability if one of the teams will finish the tournament without defeat.
  2. How many times do you need to hold a tournament, so that with a probability of 98% at least once this happened?

Solution for the first question

What we have:
N - number of teams
P (win) = 1/2
P (lose) = 1/2
Necessary to seek:
P (if one of the teams will finish the tournament without defeat) - ?

There are two ways to solve it, the First one is the combinatorics and the second is the probability theory. I will show the each one.

Combinatorics way

  1. First of all we need to estimate how many games N teams plays. Because of each team plays with every another team, therefore the number of games is the number of combination N by 2. ("by 2" because two teams participate in a game.)

    I'd like to recall that the factorial of a positive integer n, denoted by n!, is the product of all positive integers less than or equal to n. For instance, the factorial of N is 1 × 2 × 3 × ... × (N − 2) × (N − 1) × N.

    image

  2. Each team either wins nor loses, therefore there are two outcomes. Ensuing of this the amount of all possible outcomes in whole tournament is image

  3. Let's assume that the team A is won each game in the tournament. It means that from the (N - 1) games the team A is won (N - 1) games. Moreover, it means that the quantity of uncertain game's outcomes declines on (N - 1). Therefore the overall number of outcomes, with the condition that team A wins all games is: image

  4. However, nobody knows which exactly team wins all games. It's not necessary that team A wins, as same as A it could be either team B, nor team C, nor team D and so on. If we will deem each case (for each team), then the overall amount of outcomes increases by N (number of teams) times. image For N teams it (number of all possible outcomes, where one of the teams wins) equals to: image

  5. Recall that probability definition:

    The probability of an event is a number indicating how likely that event will occur. And the probability of an event is: image where m - the number of demanded outcomes and N - the number of all possible outcomes.

    Therefore to obtain the probability of a team will finish the tournament without defeat we need to divide the number of outcomes where one of the teams is won whole tournament image by number of all possible outcomes image image

  6. Here it is, the answer. It means that the probability of a team wins the whole tournament without lose is image

Probability theory way

If be honest this approach much easier than previous one.

  1. Let's assume that the team A wins each game in the tournament, and the probability of this event is image - the probability of an outcome (whether lose nor victory) of one game. And (N − 1) the number of games which plays the team A. image

  2. However the probability of that the one of the teams wins the tournament is image

  3. So, the probability that one of the teams wins the tournament without defeat is image

Solution of the second part

How many times do you need to hold a tournament, so that with a probability of 98% at least once this happened (one of the teams wins the tournament without defeat)?

  1. Let x be the quantity of tournaments required to befell to approximate the probability of the event (victory without defeat) to 98%.

  2. Then, image

    Let's explore the equation above. In the previous part we've figured out the probability that one of the teams wins the whole tournament without defeat is image or image. Of course the probability of that this event never happens is (100% - probability of happens at least once). Therefore, in our case it looks like image, in other words this equation shows us the probability that there is no a team which wins a tournament without defeat. But this probability represents the only tournament, however we have x tournaments, and in the each one there is shouldn't be a team which wins at least 1 tournament without defeat.image

    And the last thing, when we subtracting from 100% the probability of the event when there is no a team which wins at least one tournament without defeat. This shows us the probability when a team which wins at least one tournament exsists.

  3. And according the task's condition this probability equal to 98% or 0.98

    image

  4. Substitute it under logarithm image

  5. And use log-power rule

    image

  6. Derive the x

    image

Here is it! The number of required tournaments is image Of course it depends on number of teams, for instance for two teams the only tournament enough.

Answer

The answer for the Problem 1

  1. The probability that one of the teams wins the tournament without defeat is image
  2. The number of required to befell tournaments to approximate the probability of the event (victory without defeat) to 98% is image

The Tasks Descriptions (PDF-FILE to download)

The Tasks Solutions (PDF-FILE to download)

About

The task on Data Analyst Internship represented by Rambler&Co🧮

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages