-
Notifications
You must be signed in to change notification settings - Fork 67
/
04-DA-Clickstream.scala
executable file
·1 lines (1 loc) · 155 KB
/
04-DA-Clickstream.scala
1
{"version":"NotebookV1","origId":503877321546281,"name":"04-DA-Clickstream","language":"scala","commands":[{"version":"CommandV1","origId":503877321546283,"guid":"378e08eb-4276-4e6d-b438-8401fafd7594","subtype":"command","commandType":"auto","position":1.0,"command":"%md ![Wikipedia/Spark Logo](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark.png) \n\n#### Analyze Feb 2015 Wikipedia Clickstream\n\n**Objective:**\nStudy the Wikipedia Clickstream\n\n**Time to Complete:**\n30 mins\n\n**Data Source:**\nClickstream (<a href=\"http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82\" target=\"_blank\">1.2 GB</a>)\n\n**Business Questions:**\n\n* Question # 1) How much traffic did Google send to the \"Apache Spark\" article in Feb 2015?\n* Question # 2) How many unique articles did Google send traffic to?\n* Question # 3) What are the top 10 articles requested from Wikipedia?\n* Question # 4) Who sent the most traffic to Wikipedia in Feb 2015?\n* Question # 5) What percentage of the traffic Wikipedia received came from other English Wikipedia pages?\n* Question # 6) What were the top 5 trending articles on Twitter?\n* Question # 7) What are the most requested missing pages?\n* Question # 8) What does the traffic inflow vs outflow look like for the most requested pages?\n* Question # 9) What does the traffic flow pattern look like for the \"San Diego\" article? Create a visualization for this.\n\n**Technical Accomplishments:**\n* Learn how to use the Spark CSV Library to read structured files\n* Use `%sh` to run shell commands\n* Learn about Spark's architecture and JVM sizing\n* Use `jps` to list Java Virtual Machines\n* Repartition a DataFrame\n* Use the following DataFrame operations: `printSchema()`, `select()`, `show()`, `count()`, `groupBy()`, `sum()`, `limit()`, `orderBy()`, `filter()`, `withColumnRenamed()`, `join()`, `withColumn()`\n* Write a User Defined Function (UDF)\n* Join 2 DataFrames\n* Create a Google visualization to understand the clickstream traffic for the \"San Diego\" article\n* Bonus: Explain in DataFrames and SQL\n\nLab idea from <a href=\"https://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/\" target=\"_blank\">Ellery Wulczyn</a>","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"78e684a8-430f-4094-8ff1-a38779ab380f"},{"version":"CommandV1","origId":503877321546284,"guid":"058243bf-2279-4dda-9131-70517c785615","subtype":"command","commandType":"auto","position":2.0,"command":"%md Attach to, and then restart your cluster first to clear out old memory caches and get to a default, standard environment. The restart should take 1 - 2 minutes.\n\n![Restart Cluster](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/restart_cluster.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"bee2d90a-b185-4e1c-a997-587fcf7ade41"},{"version":"CommandV1","origId":503877321546285,"guid":"323bd779-4aeb-4627-89a1-300561cfa127","subtype":"command","commandType":"auto","position":3.0,"command":"%md ####![Wikipedia Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_wikipedia_tiny.png) ![Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_spark_tiny.png) **Introduction: Wikipedia Clickstream**","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9efa0647-b927-4e12-a4a7-23b67c205427"},{"version":"CommandV1","origId":503877321546286,"guid":"f00eb00c-6bbf-49ab-92ce-a0f6144a619e","subtype":"command","commandType":"auto","position":4.0,"command":"%md The file we are exploring in this lab is the February 2015 English Wikipedia Clickstream data.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5dc1c940-9425-4851-ad1e-41d8773c754d"},{"version":"CommandV1","origId":503877321546287,"guid":"58967874-f274-4c52-bc7d-0fef58d7b378","subtype":"command","commandType":"auto","position":5.0,"command":"%md ####![Intro to data](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/intro_to_data.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6e3a469a-2832-4fa6-9787-51f2cc454fcb"},{"version":"CommandV1","origId":503877321546288,"guid":"32ea7eee-4cdb-4827-aa91-f5b4c7599082","subtype":"command","commandType":"auto","position":6.0,"command":"%md How large is the data? Let's use %fs to find out:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f63dd61b-579b-44a5-9f1e-ba00db04fa5f"},{"version":"CommandV1","origId":503877321546289,"guid":"d17c57fd-9e77-4c46-8fa0-a3f4bd79a3fc","subtype":"command","commandType":"auto","position":7.0,"command":"%fs ls /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"2d17ad3c-b113-4919-ace1-7d14427074f9"},{"version":"CommandV1","origId":503877321546290,"guid":"3cc63186-66b5-4cf2-90c8-fe0951fbbd34","subtype":"command","commandType":"auto","position":8.0,"command":"%md A size of 1322171548 bytes means 1.2 GB.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d9b7b932-b433-4f64-b0a7-0f355bc2fad0"},{"version":"CommandV1","origId":503877321546291,"guid":"76ca8d91-07a7-498f-b428-cc49ac5ef343","subtype":"command","commandType":"auto","position":9.0,"command":"%md According to Wikimedia: \n\n>\"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the \"referer\". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015.\"","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6150fea0-7eec-4992-b86e-b231be3b8442"},{"version":"CommandV1","origId":503877321546292,"guid":"701beb4e-4174-4cf1-bea4-9003f47e09ce","subtype":"command","commandType":"auto","position":10.0,"command":"%md Visually, you can imagine a user clicking through different Wikipedia articles to generate the clickstream data:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b4cd60aa-76db-4b6a-96f4-643da060ed73"},{"version":"CommandV1","origId":503877321546293,"guid":"6cd47da9-210b-4997-94cd-7092093617ff","subtype":"command","commandType":"auto","position":11.0,"command":"%md ####![Clickstream articles](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/clickstream_articles.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c2ba18ac-49d4-4e37-952c-4f717909e66a"},{"version":"CommandV1","origId":503877321546294,"guid":"c8e1ecba-9246-460c-bee8-4125a5e721fa","subtype":"command","commandType":"auto","position":12.0,"command":"%md Use the `sqlContext` to read a tab seperated values file (TSV) of the Clickstream data.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8560d4a0-7cd7-488d-b64d-26a422cfc4fd"},{"version":"CommandV1","origId":503877321546295,"guid":"220a70d9-7200-4374-b7e6-8950304d2d3e","subtype":"command","commandType":"auto","position":13.0,"command":"// Notice that the sqlContext in Databricks is actually a HiveContext\nsqlContext","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d8511353-a6b9-4bbc-9d81-a2a14765c7d1"},{"version":"CommandV1","origId":503877321546296,"guid":"0ecc8a06-514f-40a7-bc70-80edf18d3ea1","subtype":"command","commandType":"auto","position":14.0,"command":"%md A `HiveContext` includes additional features like the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. In general, you should always aim to use the `HiveContext` over the more limited `sqlContext`.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"2278af18-85c7-4089-b9f8-0b38b604684c"},{"version":"CommandV1","origId":503877321546297,"guid":"294eddb9-63cb-438d-824d-60e4ad8478a6","subtype":"command","commandType":"auto","position":15.0,"command":"%md Use the <a href=\"https://github.com/databricks/spark-csv\" target=\"_blank\">Spark CSV Library</a> to parse the tab separated file:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"43867cb7-fc00-4fd3-9092-7a36a9253da9"},{"version":"CommandV1","origId":503877321546298,"guid":"6914272a-952d-44bc-99f0-6b5c840a1ad0","subtype":"command","commandType":"auto","position":16.0,"command":"//Create a DataFrame with the anticipated structure\nval clickstreamDF = sqlContext.read.format(\"com.databricks.spark.csv\")\n .option(\"header\", \"true\")\n .option(\"delimiter\", \"\\\\t\")\n .option(\"mode\", \"PERMISSIVE\")\n .option(\"inferSchema\", \"true\")\n .load(\"dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed\")","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477321919385E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1be5363d-6004-4149-8244-5f60d4a65886"},{"version":"CommandV1","origId":503877321546299,"guid":"6d270b53-e937-4cfa-a48b-d838bf226132","subtype":"command","commandType":"auto","position":17.0,"command":"%md Note that it took 1 minute to read the 1.2 GB file from S3. The above cell kicked off 2 Spark jobs, the first job has one task and just infers the schema from the file. The 2nd job uses 20 tasks to read the 1.2 GB file in parallel (each task reads about 64 MB of the file). ","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"24bf3c16-9568-474e-ba02-71c123ff1244"},{"version":"CommandV1","origId":503877321546300,"guid":"50b20f87-d080-4017-833b-718716b336ca","subtype":"command","commandType":"auto","position":18.0,"command":"%md The `display()` function shows the DataFrame:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"cae860e5-da38-4f71-83ca-96c313b7b1d6"},{"version":"CommandV1","origId":503877321546301,"guid":"6fcd4f79-e351-46e5-9494-4c433d1e835f","subtype":"command","commandType":"auto","position":19.0,"command":"display(clickstreamDF)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f629e10f-f68b-45d9-b868-2ba949665b25"},{"version":"CommandV1","origId":503877321546302,"guid":"8cad270e-4e43-4f38-8775-4d2fc553d209","subtype":"command","commandType":"auto","position":20.0,"command":"%md For example, if you go to the \"Louden Up Now\" Wikipedia article, you will see a link to the !!! music album: <a href=\"https://en.wikipedia.org/wiki/Louden_Up_Now\" target=\"_blank\">Louden_Up_Now</a>","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a8d3d613-8dc1-449a-b35e-02cb627e9ce6"},{"version":"CommandV1","origId":503877321546303,"guid":"230cec1b-9f26-4db5-9ac6-89097d612bbd","subtype":"command","commandType":"auto","position":21.0,"command":"%md The DataFrame contains 22 million rows with referer/resource pairs:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"44aec8ca-1d3d-4302-a5f5-334087075f90"},{"version":"CommandV1","origId":503877321546304,"guid":"6d013072-09dd-44bf-8b96-409fe9147bc1","subtype":"command","commandType":"auto","position":22.0,"command":"%md ####![DF pairs](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/df_pairs.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"cb73a233-f713-4f2f-a324-43c96093fe34"},{"version":"CommandV1","origId":503877321546305,"guid":"ecc056ce-9b01-488a-ab3a-a799f6a19487","subtype":"command","commandType":"auto","position":23.0,"command":"%md `printSchema()` prints out the schema, the data types and whether a column can be null:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f50c2783-0d7f-4e15-aebc-00d3a83357be"},{"version":"CommandV1","origId":503877321546306,"guid":"12ac12e8-b093-479a-a916-f846c9769f96","subtype":"command","commandType":"auto","position":24.0,"command":"clickstreamDF.printSchema()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"281422b3-127c-4085-9cf3-926bf4c70080"},{"version":"CommandV1","origId":503877321546307,"guid":"a464e118-ac36-47ac-aeb0-d1699b2c0ce8","subtype":"command","commandType":"auto","position":25.0,"command":"%md Here is what the 6 columns mean:\n\n- `prev_id`: *(note, we will be dropping this)* if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on\n\n- `curr_id`: *(note, we will be dropping this)* the MediaWiki unique page ID of the article the client requested\n\n- `prev_title`: the result of mapping the referer URL to the fixed set of values described above\n\n- `curr_title`: the title of the article the client requested\n\n- `n`: the number of occurrences of the (referer, resource) pair\n\n- `type`\n - \"link\" if the referer and request are both articles and the referer links to the request\n - \"redlink\" if the referer is an article and links to the request, but the request is not in the production enwiki.page table\n - \"other\" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8053e992-a1e2-4416-8ef3-6a2cb94b2567"},{"version":"CommandV1","origId":503877321546308,"guid":"7b527b04-bb13-4c29-8623-834cd6fbadc3","subtype":"command","commandType":"auto","position":26.0,"command":"%md The referer (or prev_title) can be one of the following 10 options:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"cb12b961-10fd-452d-bb8a-b3fcb9ddfeef"},{"version":"CommandV1","origId":503877321546309,"guid":"c135ce80-e501-4f09-999f-ddfee2710deb","subtype":"command","commandType":"auto","position":27.0,"command":"%md ####![referer](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/referer.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5092fe54-7c31-4286-9012-386a3ee8332c"},{"version":"CommandV1","origId":503877321546310,"guid":"5ee0e0b2-95ae-42c6-8a72-83f726c61ffd","subtype":"command","commandType":"auto","position":28.0,"command":"%md `other-empty` is typically HTTPS traffic.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d00f0e39-2886-43e8-9585-84b664c5b07d"},{"version":"CommandV1","origId":503877321546311,"guid":"7bfa2668-86dc-436e-bf0f-5378e40727c1","subtype":"command","commandType":"auto","position":29.0,"command":"%md ####![n](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/n.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"179129fe-2bdc-4346-935f-3814ae38cbf7"},{"version":"CommandV1","origId":503877321546312,"guid":"60840b90-e2cf-4dee-9e07-ac4c061f05ec","subtype":"command","commandType":"auto","position":30.0,"command":"%md ####![type](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/type.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1295fc04-7ab7-490f-b8c9-efbdc77e6585"},{"version":"CommandV1","origId":503877321546313,"guid":"01fa44c9-cb45-49a2-aa41-3e93517f1e07","subtype":"command","commandType":"auto","position":31.0,"command":"%md The two id columns (prev_id and curr_id) are not used in this lab, so let's create a new DataFrame without them:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"04d9eb00-bde2-49c5-95ea-93ab082e5a60"},{"version":"CommandV1","origId":503877321546314,"guid":"3cc4d4b5-99c9-4bcd-a6ac-682f60a52e0b","subtype":"command","commandType":"auto","position":32.0,"command":"// Select out just the columns we need for our analysis\nval clickstreamNoIDsDF = clickstreamDF.select($\"prev_title\", $\"curr_title\", $\"n\", $\"type\")","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477321988286E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"76813140-d3cd-4d44-894e-7e1c6ae605dd"},{"version":"CommandV1","origId":503877321546315,"guid":"b020953b-1fcf-4771-9e56-e7a607d5d50c","subtype":"command","commandType":"auto","position":33.0,"command":"clickstreamNoIDsDF.show(5)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fde3c426-10b5-4da3-b227-5c7016296d07"},{"version":"CommandV1","origId":503877321546316,"guid":"e0dfa09d-2175-4b1f-98a0-0b7def1546db","subtype":"command","commandType":"auto","position":34.0,"command":"%md ####![Databricks Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_databricks_tiny.png) ![Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_spark_tiny.png) **Databricks + Spark Architecture**","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"811a6b89-2453-453c-8a95-e51327fd9657"},{"version":"CommandV1","origId":503877321546317,"guid":"a7eb47ed-831e-4cfc-9d51-c8cafee0491e","subtype":"command","commandType":"auto","position":35.0,"command":"%md Before analyzing the Clickstream, let's learn a bit more about the Databricks + Spark architecture. This will help us optimize the DataFrame # of partitions and Spark SQL queries.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"38c7b8d0-65d2-4306-88cc-2be2c2cee9a3"},{"version":"CommandV1","origId":503877321546318,"guid":"ac26512c-ce17-42c6-b120-ba35e30ed341","subtype":"command","commandType":"auto","position":36.0,"command":"%md Behind the scenes this notebook is connected to a Spark local mode cluster running on a r3.2xlarge EC2 machine in Amazon:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a72d4ca7-bc57-4811-9a2e-b8eed8d4a757"},{"version":"CommandV1","origId":503877321546319,"guid":"7b5dd533-d29b-4374-9216-d2e69b4662c9","subtype":"command","commandType":"auto","position":37.0,"command":"%md ####![EC2 host](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/ec2machine_w_10contatiners.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6aaec55e-e5cd-4ff9-be11-f5cd3af87313"},{"version":"CommandV1","origId":503877321546320,"guid":"0f76044a-3ccc-44d0-860e-541b8bf04424","subtype":"command","commandType":"auto","position":38.0,"command":"%md Each EC2 machine is being shared by 9 other Databricks Community Edition users.\n\nThe Amazon r3.2xlarge machine has 8 vCPUs and 60 GB of memory. Learn more about <a href=\"https://aws.amazon.com/ec2/instance-types/\" target=\"_blank\">Amazon EC2 Instance Types</a>.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9922b507-20a3-4bb7-9366-7fe2f317f164"},{"version":"CommandV1","origId":503877321546321,"guid":"051a3457-e09f-40c8-aa19-a4f40933687c","subtype":"command","commandType":"auto","position":39.0,"command":"%md `%sh` is a magic command in Databricks to run shell commands in the container. \n\nCheck how much memory is free on the EC2 machine:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"91749a8e-64b3-4d3c-b465-4b7a6c1f9076"},{"version":"CommandV1","origId":503877321546322,"guid":"39e4494a-1dc7-4368-b92d-f7f26cb13a3b","subtype":"command","commandType":"auto","position":40.0,"command":"%sh free -mh","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5ec22352-4a06-4f38-adeb-39efd0dcd9bd"},{"version":"CommandV1","origId":503877321546323,"guid":"4e54bdb8-61cc-427b-8c78-cc4a02ea7dd0","subtype":"command","commandType":"auto","position":41.0,"command":"%md The `jps` tool lists the Java Virtual Machines (JVMs) in the container:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"31f34049-cd07-4b6e-bdc6-668efce4870a"},{"version":"CommandV1","origId":503877321546324,"guid":"0dc546bf-886a-4c82-9923-ca0cc695b93c","subtype":"command","commandType":"auto","position":42.0,"command":"%sh jps","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"90f125f7-5ee1-407e-a065-8afcb7dad9cb"},{"version":"CommandV1","origId":503877321546325,"guid":"f189f004-5d47-4aa2-bf76-aceb88b3b2d0","subtype":"command","commandType":"auto","position":43.0,"command":"%md `DriverDaemon` above is the actual 3.7 GB Spark JVM. Ignore `jps` in the output above as it's just the `jps` command running. We'll cover `ChauffeurDaemon` soon.\n\n`jps -v` prints verbose details about each JVM:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c005cafc-d91f-4c46-9154-ac1b50fe9534"},{"version":"CommandV1","origId":503877321546326,"guid":"8c4554b5-49ae-4835-81d5-20bef965f667","subtype":"command","commandType":"auto","position":44.0,"command":"%sh jps -v","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"071759bf-56bc-4a81-90c4-f0911a71767f"},{"version":"CommandV1","origId":503877321546327,"guid":"ee5de80d-c803-44d6-9c7c-0e82bb15a364","subtype":"command","commandType":"auto","position":45.0,"command":"%md Above we can see that the DriverDaemon uses 3.7 GB of RAM (-Xms3776m and -Xmx3776m).","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"64a68ec5-7b41-496b-af9b-2541d82fc352"},{"version":"CommandV1","origId":503877321546328,"guid":"185ba39f-a2bd-4c08-bd55-2fdf8ebfbac0","subtype":"command","commandType":"auto","position":46.0,"command":"%md The 1 GB Chauffeur JVM is an internal Databricks component that multiplexes commands from different notebooks to the same `DirverDaemon` JVM (Spark local mode cluster):","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5336dc45-0e6c-4d5c-926e-341816f1f58e"},{"version":"CommandV1","origId":503877321546329,"guid":"315ab098-cbd1-41e0-a45c-b86e23144ab6","subtype":"command","commandType":"auto","position":47.0,"command":"%md ####![Chauffeur](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/chauffeur.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ccaa3154-1e7f-4532-bcdb-29214c2643cd"},{"version":"CommandV1","origId":503877321546330,"guid":"bde03847-7a3d-4a7a-af44-00c484a15397","subtype":"command","commandType":"auto","position":48.0,"command":"%md Chauffeur basically lets you have multiple notebooks attached to the same cluster.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b835bbe7-c7cf-4fd8-973d-3c237b69e7b3"},{"version":"CommandV1","origId":503877321546331,"guid":"87139c56-410d-4aea-9d2d-55963b2d9c44","subtype":"command","commandType":"auto","position":49.0,"command":"%md Click on the Executors tab in the Spark UI to see more details about the `DriverDaemon` JVM:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a83077ea-6a33-4bb9-b3fe-715d9ac1ebb4"},{"version":"CommandV1","origId":503877321546332,"guid":"e8286a9a-e436-4f8d-94e1-be151a845836","subtype":"command","commandType":"auto","position":50.0,"command":"%md ####![Executors UI](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/local_executors_ui.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c4210afc-5ae2-4a42-9fdc-bbd28c71eb41"},{"version":"CommandV1","origId":503877321546333,"guid":"2018a6f9-28ec-43f8-a825-07313fde441a","subtype":"command","commandType":"auto","position":51.0,"command":"%md Note that of the 3.6 GB Driver+Executor JVM, approximately 2.4 GB (in green) is available for memory caching + persistence.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"4066efb1-045f-4c73-abf1-dc0e37068fd8"},{"version":"CommandV1","origId":503877321546334,"guid":"bf7d02e4-eadc-4bb5-875c-8f88c8305b8f","subtype":"command","commandType":"auto","position":52.0,"command":"%md ![Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_spark_tiny.png) **Memory, DataFrame partitions and repartitioning**","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"911f4668-7a87-4165-af0d-117e1dbefce5"},{"version":"CommandV1","origId":503877321546335,"guid":"e4319a28-0f68-4c4e-bfd2-f11dc68f9c03","subtype":"command","commandType":"auto","position":53.0,"command":"%md The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"beb68126-7628-4500-b92c-dba0b48a31c4"},{"version":"CommandV1","origId":503877321546336,"guid":"1d1fe2b1-9e4b-40b2-92c1-cf41c18eed5b","subtype":"command","commandType":"auto","position":54.0,"command":"%md The DataFrame is currently 20 partitions:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9ca5164c-ed6b-4a0f-97fd-8a3ea8ded0a0"},{"version":"CommandV1","origId":503877321546337,"guid":"f4d01aff-cf5a-4f50-833f-c4a7c19cbf3d","subtype":"command","commandType":"auto","position":55.0,"command":"clickstreamNoIDsDF.rdd.partitions.size","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"18a957ed-c746-4eb2-9cb9-8455139f2ed6"},{"version":"CommandV1","origId":503877321546338,"guid":"c23ca0a7-540f-45db-b0f3-b391e131abc4","subtype":"command","commandType":"auto","position":56.0,"command":"%md We know from our previous labs that just as soon as we do an **orderBy(..)** Spark will use the default partition size of 200.\n\nBecause we want a factor 3, in this case 6 partitions, we are again going to configure the shuffle partition size before getting started.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7d8f5acf-e0ae-431c-92fc-9a5e8118f264"},{"version":"CommandV1","origId":503877321546339,"guid":"85f027ae-a013-4779-ba00-e16fbe303d3c","subtype":"command","commandType":"auto","position":57.0,"command":"sqlContext.setConf(\"spark.sql.shuffle.partitions\", \"6\")","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477322010901E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b57c038e-f77a-4fb0-bb8a-463732474129"},{"version":"CommandV1","origId":503877321546340,"guid":"87cc3d7c-c94d-4d17-8f8f-46f609a9ce9c","subtype":"command","commandType":"auto","position":58.0,"command":"%md Next we need to actually repartition the DataFrame into 6 partitions (from the initial 20) and cache it with a friendly name:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"001122b4-e1d2-4d24-8353-54d2c21475aa"},{"version":"CommandV1","origId":503877321546341,"guid":"3a7cfa3d-f412-4cb3-ad44-2a68df28381a","subtype":"command","commandType":"auto","position":59.0,"command":"val clickstreamNoIDs6partDF = clickstreamNoIDsDF.repartition(6) // Repartion to 6, a multiple of 3\nclickstreamNoIDs6partDF.registerTempTable(\"Clickstream\") // Register a \"temp\" table\nsqlContext.cacheTable(\"Clickstream\") // Cache the table\nclickstreamNoIDs6partDF.count // materialize the cache","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477322013109E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f3a8d88f-c7e0-4e67-ad51-d780f51087cf"},{"version":"CommandV1","origId":503877321546342,"guid":"06396f46-c798-45b1-b478-1dc2ca698382","subtype":"command","commandType":"auto","position":60.0,"command":"%md In the Spark UI, go to the Storage tab, then click on the DataFrame name to see details of the Clickstream DataFrame in memory:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9d0bc0de-3c17-45e7-9237-f648d9ee185e"},{"version":"CommandV1","origId":503877321546343,"guid":"5355f06f-1e62-447b-9502-2fd363f8975e","subtype":"command","commandType":"auto","position":61.0,"command":"%md ####![Clickstream in Memory](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/clickstream_in_mem.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9e3eabde-0670-41fd-aa03-b2d2952ea795"},{"version":"CommandV1","origId":503877321546344,"guid":"87bd34a6-e033-4b3e-9136-9c337d1e8ffc","subtype":"command","commandType":"auto","position":62.0,"command":"%md Notice that each partition (in green) is about 120 MB. An ideal partition size in Spark is about 50 MB - 200 MB.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1d1010ab-be16-484b-9f90-373ca111aaa4"},{"version":"CommandV1","origId":503877321546345,"guid":"8913f1b7-2be1-4f97-8882-e0f28dd33f54","subtype":"command","commandType":"auto","position":63.0,"command":"%md ####![Clickstream in Memory](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/clickstream_in_mem_diagram.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"cb6ecb66-c719-4b6e-915c-9d05e2526bb5"},{"version":"CommandV1","origId":503877321546346,"guid":"3b02e96c-0fc0-426b-9b73-9429d10eba95","subtype":"command","commandType":"auto","position":64.0,"command":"%md In order to run full scan operations against the Clickstream DataFrame, we will need to use 6 tasks, one per partition. \n\nSince the cluster has only 3 slots, it will run the first 3 tasks and then run 3 more tasks.\n\nLet's see this in action with the `count()` action:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a1932a5e-b374-40e9-894d-837189ce7418"},{"version":"CommandV1","origId":503877321546347,"guid":"07d828c6-f34f-4e52-bd04-c78217b3289c","subtype":"command","commandType":"auto","position":65.0,"command":"clickstreamNoIDs6partDF.count","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b76831dc-7330-49ec-a3e7-534c4fbfd2d5"},{"version":"CommandV1","origId":503877321546348,"guid":"754dc896-5e15-485e-ae4d-05df53db14f6","subtype":"command","commandType":"auto","position":66.0,"command":"%md Expand the Spark Jobs and Job # above to see that 6 tasks were launched to scan the DataFrame. ","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fbc913df-5002-4984-b3e6-268fe4c1cc8b"},{"version":"CommandV1","origId":503877321546349,"guid":"300bab53-f3a0-425b-956f-4d33831b168e","subtype":"command","commandType":"auto","position":67.0,"command":"%md ####![6 Tasks](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/clickstream/6tasks.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7fe911a2-b50c-4520-99e5-e1224f55093d"},{"version":"CommandV1","origId":503877321546350,"guid":"92414ac9-f85a-4a3c-a55e-10f7d2fcafd1","subtype":"command","commandType":"auto","position":68.0,"command":"%md The first stage was skipped because the DataFrame is already cached in memory and doesn't need to be re-read from disk.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6565189f-4b85-41e6-bbeb-7a92f72ec8fc"},{"version":"CommandV1","origId":503877321546351,"guid":"df3ee1c3-08c5-4a0b-89f3-fce52c60da09","subtype":"command","commandType":"auto","position":69.0,"command":"%md Before continuing, let's rename the DataFrame with a more friendly name (so it's easier to type):","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0eb454ba-e797-4732-abf8-09a250a43e5d"},{"version":"CommandV1","origId":503877321546352,"guid":"28a0dd88-44ba-41b3-a901-aeb5750aee99","subtype":"command","commandType":"auto","position":70.0,"command":"val clickstreamDF2 = clickstreamNoIDs6partDF","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477322286483E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fb8948a4-5aa2-41b6-a7c1-e8dbce78ad17"},{"version":"CommandV1","origId":503877321546353,"guid":"14ac1447-f374-421e-8db6-68f08847efe2","subtype":"command","commandType":"auto","position":71.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-1) How much traffic did Google send to the \"Apache Spark\" article in Feb 2015?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f6882d50-3be1-4eec-88a5-792edea6322a"},{"version":"CommandV1","origId":503877321546354,"guid":"465c18a3-ea74-4e31-b456-01a1474636b8","subtype":"command","commandType":"auto","position":72.0,"command":"%md We will need to filter our DataFrame where the previous title is Google and the current title is Apache Spark:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ad4bc6fa-a7bf-4a17-89b5-ca3b282e7a77"},{"version":"CommandV1","origId":503877321546355,"guid":"8a874b1a-0ccb-4006-947c-2e5e10f73ec5","subtype":"command","commandType":"auto","position":73.0,"command":"clickstreamDF2.filter($\"prev_title\" === \"other-google\" && $\"curr_title\" === \"Apache_Spark\").show()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fcc1d635-56d8-46f8-8072-679f5b4c3021"},{"version":"CommandV1","origId":503877321546356,"guid":"980e58ca-3edf-4deb-bda4-d21672b12e62","subtype":"command","commandType":"auto","position":74.0,"command":"%md 14,361 visits to \"Apache Spark\" came from Google.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"009e7e6a-f5d7-4c42-b03f-26b7f2348d32"},{"version":"CommandV1","origId":503877321546357,"guid":"fe7268a6-9773-46f8-8701-8d618c5a35ce","subtype":"command","commandType":"auto","position":75.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-2)How many unique articles did Google send traffic to?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"442c4490-1768-4bbb-bd02-b7e19be70c08"},{"version":"CommandV1","origId":503877321546358,"guid":"5f835d0e-b760-485a-b088-312ba37162fa","subtype":"command","commandType":"auto","position":76.0,"command":"clickstreamDF2.filter($\"prev_title\" === \"other-google\").count()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"478a0d02-6465-48a1-91d8-052ca66eac6f"},{"version":"CommandV1","origId":503877321546359,"guid":"2735d07d-7686-4948-a11d-d508dc17f112","subtype":"command","commandType":"auto","position":77.0,"command":"%md Google sent traffic to about 2.5 million unique articles out of a total of ~5 million articles.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"143c0c6a-c475-4679-8c8e-dc46a6904c69"},{"version":"CommandV1","origId":503877321546360,"guid":"cbebf20f-2704-4e13-a308-b66f469de0d8","subtype":"command","commandType":"auto","position":78.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-3) What are the top 10 articles requested from Wikipedia?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9a99f35e-53fc-4597-ab7c-5feca11e26b2"},{"version":"CommandV1","origId":503877321546361,"guid":"35d324b5-84f8-4b51-9963-c6de8232cbd2","subtype":"command","commandType":"auto","position":79.0,"command":"%md We start by grouping by the current title and summing the number of occurrances of the current title:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f7a9766c-0080-4280-a990-0911547ce9cd"},{"version":"CommandV1","origId":503877321546362,"guid":"578f41c3-a503-4863-811c-82e85ee6dc70","subtype":"command","commandType":"auto","position":80.0,"command":"display(clickstreamDF2.groupBy(\"curr_title\").sum().limit(10))","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e2f35bac-6341-407e-8403-9672495eb944"},{"version":"CommandV1","origId":503877321546363,"guid":"99cedee0-9182-456b-968b-a5274e784905","subtype":"command","commandType":"auto","position":81.0,"command":"%md To see just the top 10 articles requested, we also need to order by the sum of n column, in descending order.\n\n** Challenge 1:** Can you build upon the code in the cell above to also order by the sum column in descending order, then limit the results to the top ten?\n\nHint, look up the `orderBy()` operation in the <a href=\"https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrame\" target=\"_blank\">DataFrame API docs</a> via a new browser tab.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7a5727e0-97e3-4f09-8ad0-0e16c8249249"},{"version":"CommandV1","origId":503877321546364,"guid":"aa9393a2-6b0c-4a7d-a4f0-01a99ee9a496","subtype":"command","commandType":"auto","position":82.0,"command":"// Type your answer here...","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f886b285-6dc7-424d-bbf6-9143320ff9b8"},{"version":"CommandV1","origId":503877321546365,"guid":"15e91010-0185-46ab-a3d1-c41af8d737b3","subtype":"command","commandType":"auto","position":83.0,"command":"%md The most requested articles tend to be about media that was popular in February 2015, with a few exceptions.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b1c7e615-dd34-4b18-8fdb-f9dafcafe667"},{"version":"CommandV1","origId":503877321546366,"guid":"00f81d9b-9c42-4e72-ad1e-030efbf22071","subtype":"command","commandType":"auto","position":84.0,"command":"%md Spark SQL lets you seemlessly move between DataFrames and SQL. We can run the same query using SQL:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1339a359-07d0-4ce0-a997-5beab32268c3"},{"version":"CommandV1","origId":503877321546367,"guid":"59f0450a-2b7c-4a34-a4df-fc3ca791b573","subtype":"command","commandType":"auto","position":85.0,"command":"//First register the table, so we can call it from SQL\nclickstreamDF2.registerTempTable(\"clickstream\")","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fa8132e2-d6fe-465b-902d-8136478b1314"},{"version":"CommandV1","origId":503877321546368,"guid":"bf7e2ae7-748a-49c9-9705-e16102b6ded2","subtype":"command","commandType":"auto","position":86.0,"command":"%md Do a simple \"Select all\" query from the `clickstream` table to make sure it's working:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"89873145-ef10-4d77-857f-b75eb7125377"},{"version":"CommandV1","origId":503877321546369,"guid":"3575a98e-e015-4f58-9f04-e6c02b5182ef","subtype":"command","commandType":"auto","position":87.0,"command":"%sql SELECT * FROM clickstream LIMIT 6;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ea7544c0-4941-4393-889c-53cf1344f098"},{"version":"CommandV1","origId":503877321546370,"guid":"976bc459-f0a8-453a-b5b1-38b5d2757f9f","subtype":"command","commandType":"auto","position":88.0,"command":"%md Now we can translate our DataFrames query to SQL:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a58b5dce-0bd4-4358-b239-5ff230419472"},{"version":"CommandV1","origId":503877321546371,"guid":"e5019695-1c01-4060-920d-f2fa3f0d290e","subtype":"command","commandType":"auto","position":89.0,"command":"%sql SELECT curr_title, SUM(n) AS top_articles FROM clickstream WHERE curr_title != 'Main_Page' GROUP BY curr_title ORDER BY top_articles DESC LIMIT 10;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"96b4f26a-3625-4683-8e3e-8dca264518e8"},{"version":"CommandV1","origId":503877321546372,"guid":"9366eaea-9316-410e-b0f8-58d57ff5f88c","subtype":"command","commandType":"auto","position":90.0,"command":"%md SQL also has some handy commands like `DESC` (describe) to see the schema + data types for the table:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c33f6e8b-39df-4ba0-9760-3b1c1f7a103a"},{"version":"CommandV1","origId":503877321546373,"guid":"11e5af2d-d068-4d2b-b4c2-cd41e1386bf7","subtype":"command","commandType":"auto","position":91.0,"command":"%sql DESC clickstream;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"df007eb0-53cc-46cc-8946-0086354afdf5"},{"version":"CommandV1","origId":503877321546374,"guid":"0901f1cd-400b-4cba-aa7d-87ef4660b148","subtype":"command","commandType":"auto","position":92.0,"command":"%md You can use the `SHOW FUNCTIONS` command to see what functions are supported by Spark SQL:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"bc5aea13-5a73-4c51-8e81-d5c2f31df2ef"},{"version":"CommandV1","origId":503877321546375,"guid":"0e98a3f1-4340-4eaa-9f6f-7a77f309d9bc","subtype":"command","commandType":"auto","position":93.0,"command":"%sql SHOW FUNCTIONS;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b1191df6-d9ba-446b-8a5a-7668e164365f"},{"version":"CommandV1","origId":503877321546376,"guid":"c9bd0350-90ff-4bf9-b9f0-b2f6ae566ce6","subtype":"command","commandType":"auto","position":94.0,"command":"%md `EXPLAIN` can be used to understand the Physical Plan of the SQL statement:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9bca152d-d1af-4210-b462-bac40c612264"},{"version":"CommandV1","origId":503877321546377,"guid":"f240abcd-c0fa-4dae-8ad8-f5e8f8a0aaad","subtype":"command","commandType":"auto","position":95.0,"command":"%sql EXPLAIN SELECT curr_title, SUM(n) AS top_articles FROM clickstream WHERE curr_title != 'Main_Page' GROUP BY curr_title ORDER BY top_articles DESC LIMIT 10;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9edab893-629d-4b2a-ae6a-1c8f660df60d"},{"version":"CommandV1","origId":503877321546378,"guid":"c2cef7d3-3078-4536-a27a-bf0d0ae1677f","subtype":"command","commandType":"auto","position":96.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-4) Who sent the most traffic to Wikipedia in Feb 2015? So, who were the top referers to Wikipedia?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e0b7d060-4c56-4f3a-b8e8-261d0e0379b0"},{"version":"CommandV1","origId":503877321546379,"guid":"ddc34d70-ad2a-462b-8350-2d38fd6b5b3d","subtype":"command","commandType":"auto","position":97.0,"command":"clickstreamDF2\n .groupBy(\"prev_title\")\n .sum()\n .orderBy($\"sum(n)\".desc)\n .show(10)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477322274377E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"eae5e9bc-f7b3-4516-bed5-11e5bd0f944b"},{"version":"CommandV1","origId":503877321546380,"guid":"2aca5356-c2eb-4b77-a4fc-2834e157d6ea","subtype":"command","commandType":"auto","position":98.0,"command":"%md The top referer by a large margin is Google, which sent about 1.5 billion clicks to Wikipedia. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here:\n<a href=\"https://en.wikipedia.org/wiki/Wikipedia:Project_namespace\" target=\"_blank\">Wikipedia:Project_namespace</a>\n\nAlso, note that Twitter sends 10x more requests to Wikipedia than Facebook.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"07e7ba7e-e0fa-45ed-aa5c-2622240471bd"},{"version":"CommandV1","origId":503877321546381,"guid":"9e20aaff-c5b1-450c-bd22-255969b2def5","subtype":"command","commandType":"auto","position":99.0,"command":"%md Let's generate a bar graph visualization for this using the display command and changing the Plot Options:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fc4710bd-415a-4e91-a429-b166454dfdb3"},{"version":"CommandV1","origId":503877321546382,"guid":"07049963-382b-4c54-8fa7-81be3a65b8da","subtype":"command","commandType":"auto","position":100.0,"command":"display(clickstreamDF2.groupBy(\"prev_title\").sum().orderBy($\"sum(n)\".desc).limit(10))","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fd6f9f85-fce1-40ce-a341-81d36ee971ed"},{"version":"CommandV1","origId":503877321546383,"guid":"f9acf02e-51ae-4138-9780-38b07282d07c","subtype":"command","commandType":"auto","position":101.0,"command":"%md To generate a bar graph, click the Bar chart icon:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a8c2c219-b783-4b43-905c-edb2fc1ae9e4"},{"version":"CommandV1","origId":503877321546384,"guid":"a8593601-fd93-414f-96a2-a55a08416dab","subtype":"command","commandType":"auto","position":102.0,"command":"%md \n#![Bar Chart](http://i.imgur.com/XerPxW2.jpg)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9ee80843-d8a8-44ac-947e-154a272282d3"},{"version":"CommandV1","origId":503877321546385,"guid":"d1b7dcb1-680d-4417-861e-33660953eca1","subtype":"command","commandType":"auto","position":103.0,"command":"%md Then click Plot Options and rearrange the Keys and Values as soon in the screenshot below:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"2b64d8be-812e-45e0-be90-652624d08c53"},{"version":"CommandV1","origId":503877321546386,"guid":"e22838fb-a604-4f73-8f6d-1c8d32a73e1d","subtype":"command","commandType":"auto","position":104.0,"command":"%md \n#![Plot Options](http://i.imgur.com/dA4nux6.jpg)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d01d0996-5ffe-496b-82c7-1200575eeafb"},{"version":"CommandV1","origId":503877321546387,"guid":"edff86e9-67b3-428f-b042-0db03fefda87","subtype":"command","commandType":"auto","position":105.0,"command":"%md \n#![Customize Plot](http://i.imgur.com/AaF9vSo.jpg)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"3d611829-1035-4d21-a23e-58bd81b4c9dd"},{"version":"CommandV1","origId":503877321546388,"guid":"d190cc0b-cae5-4af6-ab99-bde9fb76cdde","subtype":"command","commandType":"auto","position":106.0,"command":"%md How many total requests were there for English Wikipedia pages in Feb 2015?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1b0a2a49-c906-41a0-b004-631c291c6e0d"},{"version":"CommandV1","origId":503877321546389,"guid":"9f1dd779-7ce9-4455-a0ab-fe0d09a27e0c","subtype":"command","commandType":"auto","position":107.0,"command":"// Import the sql functions package, which includes statistical functions like sum, max, min, avg, etc.\nimport org.apache.spark.sql.functions._","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7cdbdc7d-2740-4aad-9df9-9ea7b2af64e5"},{"version":"CommandV1","origId":503877321546390,"guid":"dcafbcc0-345c-4974-9910-a947ee980dff","subtype":"command","commandType":"auto","position":108.0,"command":"clickstreamDF2.select(sum($\"n\")).show()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0aad8cb6-6438-47a4-9696-297a85e1bdfd"},{"version":"CommandV1","origId":503877321546391,"guid":"a4935d21-aee8-4a1b-bf0e-e6bf9b3a9bff","subtype":"command","commandType":"auto","position":109.0,"command":"%md So 3.2 billion requests were made to English Wikipedia in Feb 2015.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d074be6f-b4ad-40b7-a77c-68a69dd76db6"},{"version":"CommandV1","origId":503877321546392,"guid":"f7513c18-826a-4562-86c0-ed0eb93745e8","subtype":"command","commandType":"auto","position":110.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-5) What percentage of the traffic Wikipedia received came from other English Wikipedia pages? \n\nSo, how many of the clicks came from other English Wikipedia pages?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8a0caea6-2a80-4cb3-8bef-96c97ed77074"},{"version":"CommandV1","origId":503877321546393,"guid":"5748ba47-82dd-4cda-92ca-bfe68d76afab","subtype":"command","commandType":"auto","position":111.0,"command":"%md Perhaps Google sent 1.5 billion clicks, but how many clicks did Wikipedia generate from internal traffic... people clicking from one Article into another Article and so on?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e7e17b27-d33d-433f-aac3-dcfb83018495"},{"version":"CommandV1","origId":503877321546394,"guid":"372f117d-afdc-450f-b57d-6e4edaccbd07","subtype":"command","commandType":"auto","position":112.0,"command":"%md ** Challenge 2:** Can you answer this question using DataFrames? Hint: Filter out all of the rows where the prev_title is google, twitter, facebook, etc and then sum up the n column.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c8be4784-2f12-4465-b13f-0755aeb17c96"},{"version":"CommandV1","origId":503877321546395,"guid":"c1226332-8cd9-42a4-8282-a53e221e9957","subtype":"command","commandType":"auto","position":113.0,"command":"// Type your answer here...","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"90f1a4ed-fe34-4114-a755-df15986c287e"},{"version":"CommandV1","origId":503877321546396,"guid":"6bcf9813-1a13-404f-952f-90c5ca7e8093","subtype":"command","commandType":"auto","position":114.0,"command":"%md Only ~1.1 billion clicks were generated by people browsing through Wikipedia by clicking from one article to another.\n\nGoogle wins!","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1e07bd44-e0c4-44bb-aec6-8271217beca2"},{"version":"CommandV1","origId":503877321546397,"guid":"96ab2147-a65c-4578-9258-77736bbda081","subtype":"command","commandType":"auto","position":115.0,"command":"%md Did you notice in the job above that Spark pipelined all of the filters into the first Stage?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fcd9566f-4807-4b20-8538-0357a49584b4"},{"version":"CommandV1","origId":503877321546398,"guid":"2dea5ca4-8a6f-44a2-a8de-529e78aba13d","subtype":"command","commandType":"auto","position":116.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-6) What were the top 5 trending articles on Twitter?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d57ac22b-b6ef-4aa9-9ff2-70504ba2178b"},{"version":"CommandV1","origId":503877321546399,"guid":"4daae1a2-f391-4ba1-9a93-93d27b5280b8","subtype":"command","commandType":"auto","position":117.0,"command":"%md ** Challenge 3:** Can you answer this question using DataFrames?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d7adeebd-6677-4054-b8c2-22c4e56c753f"},{"version":"CommandV1","origId":503877321546400,"guid":"094aa8cf-0463-4c09-a9c2-7b0ea6afbad8","subtype":"command","commandType":"auto","position":118.0,"command":"// Type your answer here...","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"147c1b5a-0027-4ae7-a960-e61c2d3f36aa"},{"version":"CommandV1","origId":503877321546401,"guid":"21aac535-1561-416d-8166-b2925c87ab24","subtype":"command","commandType":"auto","position":119.0,"command":"%md ** Challenge 4:** Try re-writing the query above using SQL:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"cc525540-76b0-46fc-9b4b-47ca0be67910"},{"version":"CommandV1","origId":503877321546402,"guid":"80c0d53e-9841-4162-b2b5-b59b46ceb1cd","subtype":"command","commandType":"auto","position":120.0,"command":"// Type your answer here... (tip: you must remove this comment)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c5a50a3a-562c-4bb6-849c-a2df5b8994b6"},{"version":"CommandV1","origId":503877321546403,"guid":"59ae04ba-3ef8-42d0-a7c6-e9c4433c8354","subtype":"command","commandType":"auto","position":121.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-7) What are the most requested missing pages? \n\n(These are the articles that someone should create on Wikipedia!)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0f794895-527f-49af-a582-5d12f550c447"},{"version":"CommandV1","origId":503877321546404,"guid":"84173e58-9e53-49e8-a8e5-92816d174b9a","subtype":"command","commandType":"auto","position":122.0,"command":"%md The type column of our table has 3 possible values:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"56db200b-f560-4a7a-b63d-d4634c9e27a9"},{"version":"CommandV1","origId":503877321546405,"guid":"545d0b8b-7dd2-4bf6-8b8b-3abe9e126ac2","subtype":"command","commandType":"auto","position":123.0,"command":"%sql SELECT DISTINCT type FROM clickstream;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"86d752d4-430a-4ac4-9b78-6b1e22bf805a"},{"version":"CommandV1","origId":503877321546406,"guid":"3b247e00-0bbb-4721-99ea-e6126a876297","subtype":"command","commandType":"auto","position":124.0,"command":"%md These are described as:\n - **link** - if the referer and request are both articles and the referer links to the request\n - **redlink** - if the referer is an article and links to the request, but the request is not in the production enwiki.page table\n - **other** - if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7cf8bec2-a0c5-411b-8466-eba79c9e18d0"},{"version":"CommandV1","origId":503877321546407,"guid":"4210aefb-7c65-4e24-a1c7-4191a79c9bfa","subtype":"command","commandType":"auto","position":125.0,"command":"%md Redlinks are links to a Wikipedia page that does not exist, either because it has been deleted, or because the author is anticipating the creation of the page. Seeing which redlinks are the most viewed is interesting because it gives some indication about demand for missing content.\n\nLet's find the most popular redlinks:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0cdf6e18-9044-4b4c-ba61-216c6c1c88ee"},{"version":"CommandV1","origId":503877321546408,"guid":"a3c22986-bbe8-4781-8128-28e2c189dc8a","subtype":"command","commandType":"auto","position":126.0,"command":"display(clickstreamDF2.filter(\"type = 'redlink'\").groupBy(\"curr_title\").sum().orderBy($\"sum(n)\".desc).limit(10))","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"3a4fa17a-85fb-43d9-bd4f-37c5bf699b3a"},{"version":"CommandV1","origId":503877321546409,"guid":"b1cf7aa1-9caa-437a-bf38-913767aa28b0","subtype":"command","commandType":"auto","position":127.0,"command":"%md Indeed there doesn't appear to be an article on the Russian actress <a href=\"https://en.wikipedia.org/wiki/Anna_Lezhneva\" target=\"_blank\">Anna Lezhneva</a> on Wikipedia. Maybe you should create it!\n\nNote that if you clicked on the link for Anna Lezhneva in this cell, then you registered another Redlink for her article.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b90c8875-8f96-4b96-8785-4dc2156a8d5c"},{"version":"CommandV1","origId":503877321546410,"guid":"2a3b1c5e-21cb-458f-bc28-5eb2af2fc8dc","subtype":"command","commandType":"auto","position":128.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-8) What does the traffic inflow vs outflow look like for the most requested pages?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9c4b2e4c-a23e-48e5-9b4a-6b30ae2b6218"},{"version":"CommandV1","origId":503877321546411,"guid":"cf69553e-9065-4d70-bda9-8de97aa88f4d","subtype":"command","commandType":"auto","position":129.0,"command":"%md Wikipedia users get to their desired article by either searching for the article in a search engine or navigating from one Wikipedia article to another by following a link. For example, depending on which technique a user used to get to his desired article of **San Diego**, the (`prev_title`, `curr_title`) tuples would look like:\n- (`other-google`, `San Diego`)\n- (`California`, `San Diego`)\n\nLets look at the ratio of incoming to outgoing links for the most requested pages.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"63ace610-e705-4b87-924e-1d32cba7da2d"},{"version":"CommandV1","origId":503877321546412,"guid":"b3b17de8-72e7-4796-8660-25031e7ad8fb","subtype":"command","commandType":"auto","position":130.0,"command":"%md First, find the pageviews per article:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"96c5d8a5-f19f-40c9-beef-2f5b9c4afc8e"},{"version":"CommandV1","origId":503877321546413,"guid":"f27fb00b-4bdc-42b1-8e78-cf54571cd295","subtype":"command","commandType":"auto","position":131.0,"command":"val pageviewsPerArticleDF = clickstreamDF2\n .groupBy(\"curr_title\")\n .sum()\n .withColumnRenamed(\"sum(n)\", \"in_count\")\n .cache()\n\npageviewsPerArticleDF.show(10)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e6d889f1-5443-4dbd-95e3-faaa8f7ae0da"},{"version":"CommandV1","origId":503877321546414,"guid":"49e9f683-27db-42a4-a572-8c88f02047f0","subtype":"command","commandType":"auto","position":132.0,"command":"%md Above we can see that the `.17_Remington` article on Wikipedia in Feb 2015, got 2,143 views.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c06e788c-94dd-47b2-9f67-aa3f57643071"},{"version":"CommandV1","origId":503877321546415,"guid":"35ef534d-14bc-45b7-9951-182e0bf78926","subtype":"command","commandType":"auto","position":133.0,"command":"%md Then, find the link clicks per article:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5b2638df-d2ea-43db-9387-778a7a68fb3d"},{"version":"CommandV1","origId":503877321546416,"guid":"9c42ad35-306e-4a98-b39a-0f74d364896a","subtype":"command","commandType":"auto","position":134.0,"command":"val linkclicksPerArticleDF = clickstreamDF2\n .groupBy(\"prev_title\")\n .sum()\n .withColumnRenamed(\"sum(n)\", \"out_count\")\n .cache()\n\nlinkclicksPerArticleDF.show(10)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"aea84d2b-93a9-474f-b0a3-5a7d40c47aa2"},{"version":"CommandV1","origId":503877321546417,"guid":"53474e87-3590-4cde-aec2-24d7c494c5ff","subtype":"command","commandType":"auto","position":135.0,"command":"%md So, when people went to the `David_Janson` article on Wikipedia in Feb 2015, 340 times they clicked on a link in that article to go to a next article. ","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"dd32ac43-7c0b-4cac-b876-5f860efc883d"},{"version":"CommandV1","origId":503877321546418,"guid":"739acebe-3ee6-49e3-a22c-6ae86be73f8b","subtype":"command","commandType":"auto","position":136.0,"command":"%md Join the two DataFrames we just created to get a wholistic picture:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ae20cb1d-31dc-43cc-9954-badaa15ccea8"},{"version":"CommandV1","origId":503877321546419,"guid":"8ebc6cae-0951-4e09-981f-492ae1ed9d5a","subtype":"command","commandType":"auto","position":137.0,"command":"val in_outDF = pageviewsPerArticleDF.join(linkclicksPerArticleDF, ($\"curr_title\" === $\"prev_title\")).orderBy($\"in_count\".desc)\n\nin_outDF.show(10)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"73d281cf-823e-4fee-9009-65d4828d8e31"},{"version":"CommandV1","origId":503877321546420,"guid":"26cfa0c6-18bd-4ac6-bdd7-6c025466a4f2","subtype":"command","commandType":"auto","position":138.0,"command":"%md The `curr_title` and `prev_title` above are the same, so we can just display one of them in the future. Next, add a new `ratio` column to easily see whether there is more `in_count` or `out_count` for an article:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"2dea9d2b-7ca2-4ead-9712-5b38b0b35ede"},{"version":"CommandV1","origId":503877321546421,"guid":"bbe3ef56-692d-4a12-95d6-8e9ee34383aa","subtype":"command","commandType":"auto","position":139.0,"command":"val in_out_ratioDF = in_outDF.withColumn(\"ratio\", $\"out_count\" / $\"in_count\").cache()\n\nin_out_ratioDF.select($\"curr_title\", $\"in_count\", $\"out_count\", $\"ratio\").show(5)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"289eb4b7-961a-45fd-8d66-b92562cd6746"},{"version":"CommandV1","origId":503877321546422,"guid":"45ae9298-ae97-4354-921e-e09b90a951b7","subtype":"command","commandType":"auto","position":140.0,"command":"%md We can see above that when clients went to the **Alive** article, almost nobody clicked any links in the article to go on to another article.\n\nBut 49% of people who visited the **Fifty Shades of Grey** article clicked on a link in the article and continued to browse Wikipedia.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8a031d30-f0b6-4d63-8304-4e728c2b16e8"},{"version":"CommandV1","origId":503877321546423,"guid":"46a8c782-b66a-4b73-9402-d9a20b6ff8cf","subtype":"command","commandType":"auto","position":141.0,"command":"%md ####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) Q-9) What does the traffic flow pattern look like for the \"Apache_Spark\" article? Create a visualization for this. ","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1b6f9622-b83a-445e-b666-0a27be211ba2"},{"version":"CommandV1","origId":503877321546424,"guid":"9df9a50e-ef11-4a7b-b7a1-d411af7434f9","subtype":"command","commandType":"auto","position":142.0,"command":"in_out_ratioDF.filter(\"curr_title = 'Apache_Spark'\").show()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1a697aee-412e-418f-9a31-6631fe81af3b"},{"version":"CommandV1","origId":503877321546425,"guid":"bdbc4ea7-b457-4dea-84ba-4d3dae2d3576","subtype":"command","commandType":"auto","position":143.0,"command":"%md Hmm, so about 18% of clients who visit the \"Apache_Spark\" page, click on through to another article.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6f0645fa-eae9-4cbf-b94a-c8addda95ba4"},{"version":"CommandV1","origId":503877321546426,"guid":"47a4f214-df1f-4c33-b529-e6171e2c9f09","subtype":"command","commandType":"auto","position":144.0,"command":"%md Which referrers send the most traffic to the Spark article?","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1ebb2d04-9618-48bf-9c9d-d00c2c42a1cd"},{"version":"CommandV1","origId":503877321546427,"guid":"3f31bfba-673e-4d2d-9e33-969f6bb46c35","subtype":"command","commandType":"auto","position":145.0,"command":"%sql SELECT * FROM clickstream WHERE curr_title LIKE 'Apache_Spark' ORDER BY n DESC LIMIT 10;","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9093c2c4-7839-4a19-ab46-0dc1885897a8"},{"version":"CommandV1","origId":503877321546428,"guid":"84e02ac0-4f47-468c-850b-0ca1775a420f","subtype":"command","commandType":"auto","position":146.0,"command":"%md Here's the same query using DataFrames and `show()`:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"47bfc19d-e995-4ebd-8f24-3a5cce539928"},{"version":"CommandV1","origId":503877321546429,"guid":"ba2b63b0-d469-4d44-a50a-fa2efe482f23","subtype":"command","commandType":"auto","position":147.0,"command":"clickstreamDF2.filter($\"curr_title\".rlike(\"\"\"^Apache_Spark$\"\"\")).orderBy($\"n\".desc).show(10)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9ef11f8f-77fa-4f90-8986-d92d6760c727"},{"version":"CommandV1","origId":503877321546430,"guid":"8c20a5d2-71f2-4d62-9668-d55a51968963","subtype":"command","commandType":"auto","position":148.0,"command":"%md ** Challenge 5:** Which future articles does the \"Apache_Spark\" article send most traffic onward to? Try writing this query using the DataFrames API:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"324fe300-5e30-4ad6-a90c-8cb71637b461"},{"version":"CommandV1","origId":503877321546431,"guid":"ce081202-8bb1-44d4-b05c-7c934b520837","subtype":"command","commandType":"auto","position":149.0,"command":"// Type your answer here...","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fffa71b0-86b8-470b-b2a1-53b2d9438c80"},{"version":"CommandV1","origId":503877321546432,"guid":"cd14fd01-c7bb-433c-97ae-e9187838ee33","subtype":"command","commandType":"auto","position":150.0,"command":"%md Above we can see the topics most people are interested in, when they get to the \"Apache_Spark\" article.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"2614cf15-9ddc-416e-9202-87308c8a9487"},{"version":"CommandV1","origId":503877321546433,"guid":"f3109eb0-00ff-4bc5-b34a-9a31dcb22606","subtype":"command","commandType":"auto","position":151.0,"command":"%md Finally, we'll use a Google Visualization library to create a Sankey diagram. Sankey diagrams are a flow diagram, in which the width of the arrows are shown proportionally to the flow quantify traffic:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a6cef02a-aaa4-40fb-8ac6-ef28df20dd58"},{"version":"CommandV1","origId":503877321546434,"guid":"d74f4cd0-c944-4b50-b318-b8aacc751ac8","subtype":"command","commandType":"auto","position":152.0,"command":"// Note you may need to disable your privacy browser extensions to make this work (especially Privacy Badger)\n\nvar content = \"\"\"\n<!DOCTYPE html>\n<body>\n<script type=\"text/javascript\"\n src=\"https://www.google.com/jsapi?autoload={'modules':[{'name':'visualization','version':'1.1','packages':['sankey']}]}\">\n</script>\n\n<div id=\"sankey_multiple\" style=\"width: 800px; height: 640px;\"></div>\n\n<script type=\"text/javascript\">\ngoogle.setOnLoadCallback(drawChart);\n function drawChart() {\n var data = new google.visualization.DataTable();\n data.addColumn('string', 'From');\n data.addColumn('string', 'To');\n data.addColumn('number', 'Weight');\n data.addRows([\n\"\"\"\n \nclickstreamDF2\n .filter( ($\"curr_title\".rlike(\"\"\"^Apache_Spark$\"\"\") and $\"prev_title\".rlike(\"\"\"^other-.*$\"\"\")) or ($\"prev_title\".rlike(\"\"\"^Apache_Spark$\"\"\") and $\"n\" > 100) )\n .orderBy($\"n\")\n .collect().foreach(x => { content += s\"\\n['${x.get(0)}', '${x.get(1)}', ${x.get(2)}],\" })\n\ncontent += \"\"\"\n ]);\n // Set chart options\n var options = {\n width: 600,\n sankey: {\n link: { color: { fill: '#grey', fillOpacity: 0.3 } },\n node: { color: { fill: '#a61d4c' },\n label: { color: 'black' } },\n }\n };\n // Instantiate and draw our chart, passing in some options.\n var chart = new google.visualization.Sankey(document.getElementById('sankey_multiple'));\n chart.draw(data, options);\n }\n</script>\n </body>\n</html>\"\"\"\n\ndisplayHTML(content)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":1.477322291835E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"bmc@databricks.com","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6e988659-8189-4daa-85a2-19a6dfe3f072"},{"version":"CommandV1","origId":503877321546435,"guid":"3ba2b1be-7c5e-4072-a522-34638d2b8beb","subtype":"command","commandType":"auto","position":153.0,"command":"%md The chart above shows how people get to a Wikipedia article and what articles they click on next.\n\nThis diagram shows incoming and outgoing traffic to the \"Apache\\_Spark\" article. We can see that most people found the \"Apache\\_Spark\" page through Google search and only a small fraction of the readers went on to another article","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6f95e6b0-daab-4eb3-834b-69dca41280d7"},{"version":"CommandV1","origId":503877321546436,"guid":"6ba2e58d-5278-496c-acaa-e66d0080aa0d","subtype":"command","commandType":"auto","position":154.0,"command":"%md Note that it is also possible to programmatically add in the values in the HTML, so you don't have to hand-code it. But to keep things simple, we've hand coded it above. To learn how to do it programatically, check out this notebook: Databricks Guide / Visualizations / HTML, D3 and SVG.","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9f8def24-06a1-4647-8234-a49065c04e4f"},{"version":"CommandV1","origId":503877321546437,"guid":"b2a5d8d7-9180-453a-bae0-0c18533c83ed","subtype":"command","commandType":"auto","position":155.0,"command":"%md \n### Bonus:\n** Learning about Explain to understand Catalyst internals **","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a470f9cb-bb88-43bc-a892-04d94ddccfdb"},{"version":"CommandV1","origId":503877321546438,"guid":"7983ab48-28f3-45e3-996a-6beac5856c71","subtype":"command","commandType":"auto","position":156.0,"command":"%md ![Catalyst Optimizer](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pageviews/catalyst.png)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"50e32a4b-503f-403f-a813-eccc070dc13e"},{"version":"CommandV1","origId":503877321546439,"guid":"8a3f6301-3e1a-427c-b973-4dc31bb28a81","subtype":"command","commandType":"auto","position":157.0,"command":"%md The `explain()` method can be called on a DataFrame to understand its physical plan:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"4eb114cf-525b-43b9-8b69-0ac0664fbbf7"},{"version":"CommandV1","origId":503877321546440,"guid":"451f31d8-d6e7-4b95-a3f2-b6d0c0b48e79","subtype":"command","commandType":"auto","position":158.0,"command":"in_out_ratioDF.explain()","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6fdd1272-5315-4607-8e2c-647023f6146a"},{"version":"CommandV1","origId":503877321546441,"guid":"f9a2a045-9b7d-4bd2-81b0-c4ae009e7314","subtype":"command","commandType":"auto","position":159.0,"command":"%md You can also pass in `true` to see the logical & physical plans:","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"3d421077-0628-48eb-9aa2-2540362ddc30"},{"version":"CommandV1","origId":503877321546442,"guid":"827d9d72-ad58-4dba-a2e4-7fdbf6a23374","subtype":"command","commandType":"auto","position":160.0,"command":"in_out_ratioDF.explain(true)","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"09f1551b-de2f-4a04-8fbf-48037c35539a"}],"dashboards":[],"guid":"f04ef134-1664-4722-a805-6fc81611b38b","globalVars":{},"iPythonMetadata":null,"inputWidgets":{}}