Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8997][MLlib]Performance improvements in LocalPrefixSpan #7360

Closed

Conversation

feynmanliang
Copy link
Contributor

Improves the performance of LocalPrefixSpan by implementing optimizations proposed in SPARK-8997

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37103 has finished for PR 7360 at commit 70b93e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -42,22 +44,20 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable {
def run(
minCount: Long,
maxPatternLength: Int,
prefix: Array[Int],
projectedDatabase: Array[Array[Int]]): Array[(Array[Int], Long)] = {
prefix: ArrayBuffer[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBuilder is better than ArrayBuffer for Int. The latter is not specialized for Int and hence has boxing/unboxing overhead. But here, we may want to consider List to avoid re-allocating buffers. The cost is that we have to inverse the list (maybe not), e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPTree.scala#L72.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@feynmanliang
Copy link
Contributor Author

@mengxr Updated; I wonder if Stream should be used instead of Iterator since it memoizes previously computed values whereas iterator always recomputes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37153 has finished for PR 7360 at commit 2e00cba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37163 has finished for PR 7360 at commit f055d82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val prefixProjectedDatabases = getPatternAndProjectedDatabase(
prefix, frequentPrefixAndCounts.map(_._1), projectedDatabase)
prefix: List[Int],
database: Iterable[Array[Int]]): Iterator[(Array[Int], Long)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database should be an Array[Array[Int]]. We need multiple access to it. The return type should be Iterator[(List[Int], Long)].

@mengxr
Copy link
Contributor

mengxr commented Jul 14, 2015

@feynmanliang Please also remove the Experimental tag on LocalPrefixSpan with your next update. We don't need it for private classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37285 has finished for PR 7360 at commit 9212256.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

@feynmanliang I sent you some updates at feynmanliang#1. Please review and merge it if it looks good to you. Thanks!

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

Btw, I have some ideas about how to improve it. Basically, instead of getting suffixes in projection, we can actually scan from right to left for each sequence and get prefixes. Then in LocalPrefixSpan, we reverse the ordering to get the correctly ordered sequence. Using the list, the result would be stored as a tree. We might get some benefit but let us not try it in this PR:)

cc @zhangjiajin

update LocalPrefixSpan impl
@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37315 has finished for PR 7360 at commit 59db2f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 1bb8acc Jul 15, 2015
@feynmanliang feynmanliang deleted the SPARK-8997-improve-prefixspan branch July 19, 2015 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants