NiFi, or Niagara Files, is a flow-based programming solution developed by the NSA. It was open sourced in 2014 and is commercially supported. More than an ETL solution, it has the power to transform the way your business processes and consumes data.
Reason #1: Let NiFi do some Heavy Lifting before the data hits the database
Using stored procedures to translate and process data is a common approach. At scale, it’s one of the more costly ETL methods. Why? Because storing processing logic in a database requires the entire database to scale. Including the licenses.
Among other things, NiFi can:
- Ensure a file’s schema
- Lookup a value from another source based on an input field
- Translate to another format
- Support many input formats and methods
- Scale to your heart’s content
If you haven’t had a chance to tinker with NiFi’s functionality, it’s easy enough. Here’s a tutorial to help get you started.
Use Docker to download and run the default container from Docker Hub and you’ll be up and running in minutes. Here are the commands to get NiFi up and running in a Docker environment:
docker run --rm -p 8080:8080 -p 8181:8181 webcenter/alpine-nifi
Give it a minute or two, then open: http://localhost:8080/nifi.
NiFi has a vast number of processor types that provide various processing capabilities. The include processors to:
- Retrieve files through various methods
- Run database queries
- Call an API
- Wait for input
- …and hundreds more.
Play around with them and you’ll discover how powerful NiFi is.
It’s also very flexible. And if you don’t see a processor that works for your needs, it’s very extensible. There are many online examples of how to build a NiFi processor.
Reason #2: Move from Coding to Configuration
One of the biggest advantages of moving to NiFi is that it migrates code to configuration. No more modifying stored procedures. No more having to get a developer’s time to make a change.
NiFi changes are simple by comparison. Click on a processor, set its properties, and things change the next time data passes through.
To be certain, there are many configuration options that are still complex. In short order, you’ll discover that processing data in NiFi is different. You’ll want to work through a proof of concept or prototype and figure out what works best. A couple of examples:
- When reading a CSV file, should you process all records at once or break them into individual flow files?
- For field lookup in a secondary table, should you use an API or build a local cache processor?
One thing NiFi has in common with other types of development, there are many ways to do something. Try different approaches to see what the best solution is for your unique environment.
NiFi flows can get quite complex. I recommend grouping processors together into logical subsets. It’s also possible to export groups to xml and import them in another NiFi instance.
Reason #3: NiFi and Efficient Scaling
Nifi can run on a single node or on many nodes managed by ZooKeeper. Several blogs report that Apache is working on way to deploy NiFi using Kubernetes. It’s possible now, but it’s a little more complex due to the ZooKeeper requirement.
One of the most impressive things about NiFi is its speed. I have to admit — it was stunning when I first saw how fast data passed through each processor.
I worked with a client that had a need to process about 5mb of data per minute. They had been using SQL Server stored procedures that took over 30 minutes to process a large input file.
After implementing NiFi, that same file processed in less than two minutes. What’s even more amazing is that the initial comparison was between a production SQL Server environment and NiFi running on a developer’s laptop.
Like I said, it was stunning.
But it isn’t a complete walk in the park. In part two of this series, I’ll explore some of the differences between traditional programming models and flow-based programming.
If you’ve used NiFi, let me know what you think in the comments below!