Using RapidMiner for building earth science models from satellite data

bbonnlanderbbonnlander MemberPosts:1Contributor I
edited November 2018 inHelp

Hello RapidMiner developers and users,

My name is Brian Bonnlander, and I'm a research scientist with experience building earth science and ecological forecasting models from satellite data. I'm very interested in finding or helping develop a freely available data mining toolkit that can be used by earth scientists to explore and build forecasting models from large sets of gridded data, including satellite and ground observation data. Currently, much of the work in ecological forecasting is done through partnerships between earth scientists, who have the domain knowledge for ecological forecasting, and computer scientists (like myself), who write the code for data preprocessing and forecasting. The problem is that much of the code is built using proprietary tools such as Matlab, and these solutions are hard for earth scientists to understand, extend, and share with other scientists, partly because they are not coders, and partly because the supporting languages are not freely available.

It is my belief that research in earth science would be greatly enhanced if earth scientists could explore data themselves with an easy-to-use set of tools, and share their code and results for other scientists to build upon. There are grant opportunities within the U.S. for developing such tools, and I am interested in writing a grant proposal for extending a toolkit such as RapidMiner for processing large gridded datasets.

Please correct me if my assumption is not correct, but the piece that is currently missing for RapidMiner is not necessarily the ability to handle large datasets, which can run into the tens of GB for earth science models, but that it does not offer functionality for processing data with a gridded structure. For example, a common proprocessing operation with gridded data involves spatial or temporal smoothing. Suppose that the every data point is labeled with an (X,Y) location and a time T. Then a commonly used operation for preprocessing would involve smoothing values for every location (X,Y) over a time window of (T-10, T+10), or smoothing values at time T over a two-dimensional neighborhood of values around (X,Y).

Once the data are preprocessed in these ways, they are often treated as standard training examples for machine learning. The only other step I've often performed is dividing the examples into separate training sets based on some categorical attribute (such as the landcover type at location (X,Y)), and training separate models for each category.


So my questions are the following:

1. Would it be difficult to add this kind of functionality to RapidMiner (if it does not already exist)?
2. Is anyone aware of past efforts use RapidMiner for this type of earth science research?
3. Are there any geodata formats, such as GeoTiff, HDF, or NetCDF, that are already supported by RapidMiner?
4. Would it be feasible for one or two full-time developers working for about a year to add support for these types of data and data operations?

I apologize if these questions are too general, but I have failed to find answers to these questions through internet search. I very much look forward to any replies.

Thank you!

--Brian



Tagged:
Belaid

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi Brian,

    It is my belief that research in earth science would be greatly enhanced if earth scientists could explore data themselves with an easy-to-use set of tools, and share their code and results for other scientists to build upon. There are grant opportunities within the U.S. for developing such tools, and I am interested in writing a grant proposal for extending a toolkit such as RapidMiner for processing large gridded datasets.
    Sounds great! I am sure that there should be some developers in the US knowing RapidMiner and coming from a university which should be interested in such a proposal. We would be interested ourself but we have no base in the US right now...

    Please correct me if my assumption is not correct, but the piece that is currently missing for RapidMiner is not necessarily the ability to handle large datasets, which can run into the tens of GB for earth science models, but that it does not offer functionality for processing data with a gridded structure. For example, a common proprocessing operation with gridded data involves spatial or temporal smoothing. Suppose that the every data point is labeled with an (X,Y) location and a time T. Then a commonly used operation for preprocessing would involve smoothing values for every location (X,Y) over a time window of (T-10, T+10), or smoothing values at time T over a two-dimensional neighborhood of values around (X,Y).
    You are right. With certain models, it is possible to work directly on a database and there is in principle no restriction for the maximum amount of data then. However, the computation time will of course become larger and larger and it definitely should be divided among different computers if it is possible in a way you have described.

    Once the data are preprocessed in these ways, they are often treated as standard training examples for machine learning. The only other step I've often performed is dividing the examples into separate training sets based on some categorical attribute (such as the landcover type at location (X,Y)), and training separate models for each category.
    But then the question might arise how models are combined? It might of course be that such a combination is not desired anyway but one is interested in local predictions.
    1. Would it be difficult to add this kind of functionality to RapidMiner (if it does not already exist)?
    No, it is not that hard. We had a small student project here for some months coming up with a distributed version of RapidMiner. We found that it was not production ready yet and decided to start again from the scratch but the project at least have shown that it is possible to distribute the tasks among a grid and bring back and combine the results.

    2. Is anyone aware of past efforts use RapidMiner for this type of earth science research?
    我不知道但我知道peo太多细节ple from the University of Bonn, Germany, work on geo mining. If I remember correctly one of the names there was Till Rumpf.

    3. Are there any geodata formats, such as GeoTiff, HDF, or NetCDF, that are already supported by RapidMiner?
    Maybe, but I am not aware of any right now.

    4. Would it be feasible for one or two full-time developers working for about a year to add support for these types of data and data operations?
    For the data: definitely yes. For the operations: this might depend on how familiar the developers are with distributed computing / mining or, even better, with RapidMiner. But probably it is also possible to come with a system which is at least post-alpha after one year.

    I apologize if these questions are too general, but I have failed to find answers to these questions through internet search. I very much look forward to any replies.
    No need to apologize. I find it always interesting to learn what people are interested in and in which fields RapidMiner is used. I hope that my answers are helping at least a little bit...

    Cheers,
    Ingo
    Belaid
  • jdouetjdouet MemberPosts:19Maven
    Hi All,

    Maybe this will answer to your computation and preprocessing issues :
    http://www.inf.ufrgs.br/~vbogorny/software.html

    Cheers,
    Jean-Charles.
  • ThomasMThomasM MemberPosts:3Contributor I
    Hi All,

    Would you see an interest in implementing "spherical harmonics" Operator in the "feature generation" categories, a bit like "wavelets" in "time series" ? They are so useful in earth topology, sismology, etc...
    Whenever you have a physical field F verifying "Delta / Laplacian F = 0" in spherical coordinates, it has been shown that a kind of "Fourier Analysis" onto F can be performed,
    The vector base for this analysis has a double index, and is said to be orthonormal because it is built from Legendre polynoms with a specific orthonormalization process. Thus, if you have a "gridded" set of F values, you could compute many components of the model, just specifying to the operator where is F, and where are rho, theta and phi.

    Thomas.
  • RAPIDRAPID MemberPosts:1Contributor I
    1 HDF5 plugin works for HDF5 models but older models in netcdf4 /hdf4 does not but I use hdfviewer for that purpose albeit tedious

    any suggestions for how to do this for older models in netcdf4 and 3 time series ? Preferably in tcloud instances

    2 Direct conversion to ARFF And weka is possible
    https://github.com/fracpete/netcdf-converters-weka-packagecompiles
    but cannot make it work in the latest rapidminers any suggestions ? Perhaps it is the .jars outdated ?

    Sincerely Rapdio
Sign InorRegisterto comment.