Top 5 Questions about Apache NiFi
Over the last few weeks, I delivered four live NiFi demo sessions, showing how to use NiFi connectors and processors to connect to various systems, with 1000 attendees in different geographic regions. I want to thank you all for joining and attending these events! Interactive demo sessions and live Q&A are what we all need these days when working remotely from home is now a norm. If you have not seen my live demo session, you can catch up by watching it here.
I received hundreds of questions during these events, and my colleagues and I tried to answer as many as we could. As promised, here are my answers to some of the most frequently asked questions.
1.What is the difference between MiNiFi and NiFi?
MiNiFi are agents used to collect subsets of data from sensors and devices situated in remote locations. The goal is to help with the “first mile collection” of the data and to acquire the data as close as possible to its source.
These devices can be servers, workstations, and laptops but also sensors, self-driving cars, machines in factories, etc, where you want to collect specific data using some of the NiFi features within MiNiFi. The ability to filter, select, and triage the data before sending it to a destination. The objective of MiNiFi is to manage this entire process at scale with Edge Flow Manager so the Operations or IT teams can deploy different flow definitions and collect any data as the business requires. Here are some details to consider:
- NiFi is designed to be centrally located, usually in a data center or in the cloud, to move data around or collect data from well-known external systems like databases, object stores, etc. NiFi should be seen as the gateway to move data back and forth between heterogeneous environments or in a hybrid cloud architecture.
- MiNiFi operates locally to a host, does some computation and logic, and only sends the data you care about to external systems for data distribution. Such systems can be NiFi of course, but also MQTT brokers, cloud providers services, etc. MiNiFi also supports use cases where the network bandwidth may be limited and need to reduce data volume getting sent through the network.
- MiNiFi comes in two versions: C++ and Java. The MiNiFi C++ option has a very small footprint (a few MBs of memory, little CPU) but has a smaller set of processors available. The MiNiFi Java option is a lightweight single node instance, a headless version of NiFi without the user interface nor the clustering capabilities. Still, it requires Java to be available on the host.
2. Why use NiFi when one can use Kafka as an entry point to the cluster?
This is a great question and many who attended my Live NiFi Demo Jam asked this question. Here are ways you can determine when to use NiFi and when to use Kafka.
- Kafka is designed for stream-oriented use cases primarily for smaller files, and ingesting large files is not a good idea. NiFi is completely data size agnostic because file size does not matter with NiFi.
- Kafka is like a mailbox that stores the data in Kafka Topics, waiting for an application to publish and/or consume it. NiFi is like the mailman who delivers the data to the mailbox or a different destination.
- NiFi offers a wide range of protocols — MQTT, Kafka Protocol, HTTP, Syslog, JDBC, TCP/UDP, and more — to interact with when it comes to ingesting data. NiFi is a great, consistent, and unique software to manage all your data ingestion. You may want to consider sending data to Kafka for multiple downstream applications. However, NiFi should be the gateway to get the data because it supports a wide range of protocols and can develop data requirements in the same easy drag and drop interface, making the ROI very high.
- Use NiFi to move data securely to multiple locations, especially with a multi-cloud strategy.
- Kafka Connect can answer some of the questions, but it is not a universal solution when you require complex filtering, routing, enrichment and transformations when moving data.
- NiFi is also built on top of an extensible framework which provides easy ways for users to extend NiFi’s capabilities and quickly build very custom data movement flows.
3.What is the best way to expose REST API for real-time data collection at scale?
Our customer uses NiFi to expose REST API for external sources to send data to a destination. The most common protocol is HTTP.
- If your goal is to ingest data, you will use the ListenHTTP processor in NIFi, have it listen to a given port for HTTP request, and you can send any data to it.
- If you want to provide a web service using NiFi, look at HandleHTTPRequest and HandleHTTPResponse processors. By using the combination of the two processors, you will receive a request over HTTP from an external client. You will be able to do something about what data is in the request and send back a custom answer/result to the client. For example, you can use NiFi to access external systems like an FTP server over HTTP. You would use the two processors and make the request over HTTP. When you receive the query in NIFi, NiFi is making the query against the FTP server to get the file, and the file is sent back to the client.
- All of these unique requests can scale very well with NiFi. In such a use case, NiFi would scale horizontally based on the requirements, and a load balancer would be set in front of the NiFi instances to balance the load across the NiFi nodes in the cluster.
4. Can a NiFi data flow be blocked or shared based on users’ access and security policy?
NiFi provides a very fine-grained, multi-tenancy, and policy model. It is easy to set up the right policies to provide NiFi in a multi-tenant environment. You can easily have multiple process groups defined in NiFi with different sets of policies, so you have a dedicated process group for team A working on use case 1 and a dedicated process group for team B working on use case 2. Here are some things to consider:
- NiFi ensures different teams should not have access to other process groups. It is easy to set up using Apache Ranger or the internal policies within NiFi. You can have multiple teams working on numerous use cases in the same NiFi environment.
- Within a NiFi cluster, all the resources are shared by all the existing flows, and there is no resource isolation. For example, NiFi cannot allocate 60% of the resources for use case #1 and 40% of the resources for use case #2. For critical use cases, most customers will have a dedicated NiFi cluster to ensure SLAs are met. NiFi provides monitoring features to ensure resources are correctly used within the cluster and be alerted in case the cluster is undersized.
- In 2021, Cloudera will release a new solution, allowing customers to run NiFi flows in perfectly sized, dedicated NiFi clusters, running on k8s with auto-scaling (up and down). This option ensures each use case uses what is required over time while not impacting other use cases.
5. Is NiFi a good replacement for ETL and batch processing?
NiFi can certainly replace ETL for certain use cases and can also be used for batch processing. However, one should consider the kind of processing/transformation required by the use case. In NiFi, Flow Files are the way to describe events, objects, and data going through the flow. While you can execute any transformation per Flow File in NiFi, you likely don’t want to use NiFi to join Flow Files together based on a common column or do some types of windowing aggregations. In such a case, Cloudera recommends using additional solutions.
So what are the recommendations?
- In a streaming use case, the best option is to have the records sent to one or many Kafka topics using the record processors in NiFi. You can then have Flink do all the processing you want on this data (joining streams or doing windowing operations) using Continuous SQL based on our acquisition of Eventador.
- In a batch use case, you would treat NiFi as an ELT rather than an ETL (E = extract, T = transform, L = load). NiFi would capture the various datasets, do the required transformations (schema validation, format transformation, data cleansing, etc.) on each dataset and send the datasets in a data warehouse powered by Hive. Once the data is sent there, NiFi could trigger a Hive query to perform the joint operation.
I hope these answers helped your data journey as you determine how to use NiFi and the benefits it can deliver for your business needs. We’ll be hosting more live demos with Q&A sessions to cover specific topics like monitoring NiFi flows and how to automate flow deployment with NiFi. In fact, we had many questions on NiFi that deserve their own sessions!
For the upcoming webinar, please join me and register for the Live NiFi Demo Jam Returns on January 21, 2021, at 10AM PST/1:00PM EST, for another interactive no-slide meeting and the opportunity for you to ask more Apache NiFi questions.
I’m looking forward to seeing you all at these events! Happy NiFi’ing and see you soon!