"Memory problem during Pivot operator with large file for market basket analysis"
Legacy User
MemberPosts:0Newbie
Hey!
I spent the whole day trying to solve my problem but to no avail. So I thought I ask for some help here. Perhaps someone has an idea.
I want to do a market basket analysis on my data. It basically is a .csv file of this format:
SomePlaceA, 2005-01-01, SomeNameA
SomePlaceA, 2005-01-02, SomeNameB
SomePlaceA, 2005-01-02, SomeNameC
…
SomePlaceB, 2005-01-01, SomeNameB
…
The first column is the place (nominal), the second a date (date), the third a name (nominal). I need the place and date as primary key as this should be my ‘baskets’. I am interested in the ‘item sets’ for each day on each place.
I’ve loaded the .csv file into the repository as that was much faster than loading the .csv again on each run. The date types were set as above and the roles all as regular. Some attributes names were given.
There are about 3,3 Million rows in the .csv. This would be about 2 Million baskets. A smaller but similar data set does have about 1,9 Million rows and about 700K baskets.
I use the following process:
1) Retrieve (previously imported data from Repo)
1a) (Sample, for testing with smaller data portions)
2)生成连接(第一和第二attribute as a ‘primary key’)
3) Rename (attribute from 2) to Basket)
4) Select Attributes (Basket and Name)
5) Set Role (Basket to ID, although I am not sure if it is necessary)
6) Generate Attribute (Amount=1, this is needed for later ‘Pivoting’)
7) Pivot (group attribute: Basket, index attribute: name)
8) Replace Missing Values (Pivoting adds an attribute for each name. I replace the values with zero, where the given name was not in the current set.)
9) Numerical to Binomial (as FP-Growth demands that)
10) FP-Growth
Now on to my problem (or problems):
即使我禁用或删除FP-Growth我得到一个错误or message during procession of the pivot step that the process would require to much memory and will quit. This happens as soon as I set the sample size at least to 100.000. 10.000 runs fine but the last step “Numerical to Binomial” already takes ages to complete.
I use a desktop Windows 7 x64 PC with 4GB of memory . During processing the memory usage of RapidMiner rises up to 2,5-3GB so that I cannot assign it any more.
1) Is it expected behaviour that ‘Numerical to Binomial’ takes far longer than any of the other operators? Might there be a more efficient way? For a sample size of 100.000 the whole process took 16 minutes. Probably 14 of those were this one step.
2) Creating the pivot table is a huge memory burner I suspect. As there are quiet a lot of baskets this results in a huge matrix with many attributes. Is there a different way to do this? Or another, less memory consuming data format FP-Growth would accept?
Any help or advice would be greatly appreciated. Thanks!
I spent the whole day trying to solve my problem but to no avail. So I thought I ask for some help here. Perhaps someone has an idea.
I want to do a market basket analysis on my data. It basically is a .csv file of this format:
SomePlaceA, 2005-01-01, SomeNameA
SomePlaceA, 2005-01-02, SomeNameB
SomePlaceA, 2005-01-02, SomeNameC
…
SomePlaceB, 2005-01-01, SomeNameB
…
The first column is the place (nominal), the second a date (date), the third a name (nominal). I need the place and date as primary key as this should be my ‘baskets’. I am interested in the ‘item sets’ for each day on each place.
I’ve loaded the .csv file into the repository as that was much faster than loading the .csv again on each run. The date types were set as above and the roles all as regular. Some attributes names were given.
There are about 3,3 Million rows in the .csv. This would be about 2 Million baskets. A smaller but similar data set does have about 1,9 Million rows and about 700K baskets.
I use the following process:
1) Retrieve (previously imported data from Repo)
1a) (Sample, for testing with smaller data portions)
2)生成连接(第一和第二attribute as a ‘primary key’)
3) Rename (attribute from 2) to Basket)
4) Select Attributes (Basket and Name)
5) Set Role (Basket to ID, although I am not sure if it is necessary)
6) Generate Attribute (Amount=1, this is needed for later ‘Pivoting’)
7) Pivot (group attribute: Basket, index attribute: name)
8) Replace Missing Values (Pivoting adds an attribute for each name. I replace the values with zero, where the given name was not in the current set.)
9) Numerical to Binomial (as FP-Growth demands that)
10) FP-Growth
Now on to my problem (or problems):
即使我禁用或删除FP-Growth我得到一个错误or message during procession of the pivot step that the process would require to much memory and will quit. This happens as soon as I set the sample size at least to 100.000. 10.000 runs fine but the last step “Numerical to Binomial” already takes ages to complete.
I use a desktop Windows 7 x64 PC with 4GB of memory . During processing the memory usage of RapidMiner rises up to 2,5-3GB so that I cannot assign it any more.
1) Is it expected behaviour that ‘Numerical to Binomial’ takes far longer than any of the other operators? Might there be a more efficient way? For a sample size of 100.000 the whole process took 16 minutes. Probably 14 of those were this one step.
2) Creating the pivot table is a huge memory burner I suspect. As there are quiet a lot of baskets this results in a huge matrix with many attributes. Is there a different way to do this? Or another, less memory consuming data format FP-Growth would accept?
Any help or advice would be greatly appreciated. Thanks!
0
Answers
did you actually tried to use "create_view" for the Nominal to binominal operator? Will save a giant amount of memory if you have lots of nominal values.
Your resulting data set will be very sparse. So it could be worth a try to pivot your data table in batches, and append all these batches to a sparse example set.
Greetings,
Sebastian
thanks for your reply! I tried but if I remember correctly it didn't make a difference on run time or amount of data at which it throwed the memory error. To be 100% sure on this one I will have to recheck this evening. Could you elaborate on this? Are there operators in RapidMiner for this that you could guide me to? Or would I need to split and combine the data "by hand"?
How should the recombined data look like in your case? Many smaller tables with less attributes than the original? I am not sure if I understand what your plan is.
Thanks!