Skip to content

Commit

Permalink
Create README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Tagar committed Nov 3, 2017
1 parent 2e83d15 commit c6c3cb2
Showing 1 changed file with 94 additions and 0 deletions.
94 changes: 94 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# abalon
Various utility functions for Hadoop, Spark etc

<h1 id="abalon.spark.sparkutils">abalon.spark.sparkutils</h1>

<h2 id="abalon.spark.sparkutils.sparkutils_init">sparkutils_init</h2>

```python
sparkutils_init(i_spark, i_debug=False)
```

Initialize module-level variables

:param i_spark: an object of pyspark.sql.session.SparkSession
:param i_debug: debug output of the below functions?


<h2 id="abalon.spark.sparkutils.file_to_df">file_to_df</h2>

```python
file_to_df(df_name, file_path, header=True, delimiter='|', inferSchema=True, cache=False)
```

Reads in a delimited file and sets up a Spark dataframe

:param df_name: registers this dataframe as a tempTable/view for SQL access;
important: it also registers a global variable under that name
:param file_path: path to a file; local files have to have 'file://' prefix; for HDFS files prefix is 'hdfs://' (optional, as hdfs is default)
:param header: boolean - file has a header record?
:param inferSchema: boolean - infer data types from data? (requires one extra pass over data)
:param delimiter: one character
:param cache: cache this dataframe?


<h2 id="abalon.spark.sparkutils.HDFSwriteString">HDFSwriteString</h2>

```python
HDFSwriteString(dst_file, content, overwrite=True)
```

Creates an HDFS file with given content.
Notice this is usable only for small (metadata like) files.

:param dst_file: destination HDFS file to write to
:param content: string to be written to the file
:param overwrite: overwrite target file?

<h2 id="abalon.spark.sparkutils.dataframeToHDFSfile">dataframeToHDFSfile</h2>

```python
dataframeToHDFSfile(dataframe, dst_file, overwrite=False, header='true', delimiter=',', quoteMode='MINIMAL')
```

dataframeToHDFSfile() saves a dataframe as a delimited file.
It is faster than using dataframe.coalesce(1).write.option('header', 'true').csv(dst_file)
as it doesn't require dataframe to be repartitioned/coalesced before writing.
dataframeToTextFile() uses copyMerge() with HDFS API to merge files.

:param dataframe: source dataframe
:param dst_file: destination file to merge file to
:param overwrite: overwrite destination file if already exists?
:param header: produce header record? Note: the the header record isn't written by Spark,
but by this function instead to workaround having header records in each part file.
:param delimiter: delimiter character
:param quoteMode: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html

<h2 id="abalon.spark.sparkutils.sql_to_df">sql_to_df</h2>

```python
sql_to_df(df_name, sql, cache=False)
```

Runs an sql query and sets up a Spark dataframe

:param df_name: registers this dataframe as a tempTable/view for SQL access;
important: it also registers a global variable under that name
:param sql: Spark SQL query to runs
:param cache: cache this dataframe?

<h2 id="abalon.spark.sparkutils.HDFScopyMerge">HDFScopyMerge</h2>

```python
HDFScopyMerge(src_dir, dst_file, overwrite=False, deleteSource=False)
```

copyMerge() merges files from an HDFS directory to an HDFS files.
File names are sorted in alphabetical order for merge order.
Inspired by https://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/fs/FileUtil.html`line.382`

:param src_dir: source directoy to get files from
:param dst_file: destination file to merge file to
:param overwrite: overwrite destination file if already exists?
:param deleteSource: drop source directory after merge is complete

0 comments on commit c6c3cb2

Please sign in to comment.