DataFrame R Like bindings inside spark

Introduction
This proposal will identify how to work with dataframe like bindings within spark to perform the following operations: a) add a new column to a dataframe b) add a new group to a currently existing dataframe c) add a new type to a dataframe d) adding a new aggregate to the current dataframe

Operations
1. Adding a new column to a pre-existing DataFrame:
API: def addNewColumn(Column columnToAdd)
Example Code:

MahoutContext curContext = new MahoutContext
MahoutDataFrame mdf=curContext.getDataFrame
Column newColumnToAdd=new Column
newColumnToAdd.label="TrustWorthyColumn"
newColumnToAdd.dataType=Number
mdf.addColumn(newColumnToAdd)

2. Adding a new group to a dataframe
API: def addGroup(Group groupToAdd)
Example Code:
MahoutContext curContext = new MahoutContext
MahoutDataFrame mdf=curContext.getDataFrame
Vector columnsWithinGroup = new List
columnsWithinGroup.addColumn(firstColumn)
columnsWithinGroup.addColumn(secondColumn)
columnsWithinGroup.addColumn(thirdColumn)
Group groupToAdd=new Group
groupToAdd.setLabel("newGroupWithinDataFrame")
groupToAdd.setColumns(columnsWithinGroup)
MahoutDataFrame mdf=curContext.getDataFrame()
mdf.addGroup(groupToAdd)

3. Adding a new type to a dataframe (similar to adding a column except you can add a new type to be representative as a column)
API: def addType(Type typeToAdd)
Example Code:
MahoutContext curContext = new MahoutContext
MahoutDataFrame mdf=curContext.getDataFrame
Note that I am assuming that the Type class will have the following associated with it pre-defined before being passed into the function above:
Type typeToAdd = new Type
typeToAdd.setType(Number)
typeToAdd.setData(dataVector)
mdf.setType(typeToAdd)

4. Adding an aggregate (an aggregate can be defined as a set of groups or values all of which can be collapsed into a dataframe) to a dataframe, the function could take an empty or partially filled aggregate and expand the current dataframe with the new aggregate:
API: def addAggregate(Aggregate aggregateToAdd)
Example Code:
Group group1 = new Group(vector1...vectorN,list1....listN)
Group group2 = new Group(vectorN+1...vectorN+1+P,listN+1....listN+1+P)
aggregateToAdd.add(group1,group2,childMahoutDataFrame)
MahoutContext curContext = new MahoutContext
MahoutDataFrame mdf=curContext.getDataFrame
mdf.setAgggregate(aggregateToAdd)

5. Show raw statistics around a mahout dataframe
API: def showStats(MahoutDataFrame curDataFrame): Stats
The above function will return the following statistics around a dataframe (number of rows and columns, number of different types of data and their description, average length of an entity inside the dataframe,names, labels,units,number of factor levels, class,storage mode , number of NA's

More to come...

DataFrame R Like bindings inside spark

Friday, April 25, 2014

No comments:

Post a Comment