Apache Nifi Data Flow
Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. NiFi provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state.
Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data.
This blog will describes a simple data flow from Local file system to HDFS using Apache Nifi.
Before we get into exercise first we must to aware of the basic components of Apache Nifi.
NiFi provides several extension points to provide developers the ability to add functionality to the application to meet their needs. The following list provides a high-level description of the most common extension points:
The Processor interface is the mechanism through which NiFi exposes access to FlowFiles, their attributes, and their content. The Processor is the basic building block used to comprise a NiFi dataflow. This interface is used to accomplish all of the following tasks:
Read FlowFile content
Write FlowFile content
Read FlowFile attributes
Update FlowFile attributes
The ReportingTask interface is a mechanism that NiFi exposes to allow metrics, monitoring information, and internal NiFi state to be published to external endpoints, such as log files, e-mail, and remote web services.
A ControllerService provides shared state and functionality across Processors, other ControllerServices, and ReportingTasks within a single JVM. An example use case may include loading a very large dataset into memory. By performing this work in a ControllerService, the data can be loaded once and be exposed to all Processors via this service, rather than requiring many different Processors to load the dataset themselves.
The FlowFilePrioritizer interface provides a mechanism by which FlowFiles in a queue can be prioritized, or sorted, so that the FlowFiles can be processed in an order that is most effective for a particular use case.
An AuthorityProvide is responsible for determining which privileges and roles, if any, a given user should be granted.
Steps to Workout:
1.Login to Nifi
2.Drag Processor GetFile
3.Drag Processor Put HDFS
4.Configuring GetFile processor
5.Configuring HDFS Processor
6.Required input file path
7.Make a connection
8.Start the processors
9.Check Output File Location
Clicking the login link will open the log in page. If the user is logging in with their username/password they will be presented with a form to do so.
Building a DataFlow:
A DFM is able to build an automated dataflow using the NiFi UI. Simply drag components from the toolbar to the canvas, configure the components to meet specific needs, and connect the components together.
Adding Components to the Canvas
The Processor is the most commonly used component, as it is responsible for data ingress, egress, routing, and manipulating. There are many different types of Processors. In fact, this is a very common Extension Point in NiFi, meaning that many vendors may implement their own Processors to perform whatever functions are necessary for their use case. When a Processor is dragged onto the canvas, the user is presented with a dialog to choose which type of Processor to use:
Clicking the Add button or double-clicking on a GetFile Processor Type will add the selected Processor to the canvas at the location that it was dropped.
Clicking the Add button or double-clicking on a PutHDFS Processor Type will add the selected Processor to the canvas at the location that it was dropped.
Once you have dragged a Processor onto the canvas, you can interact with it by right-clicking on the Processor and selecting an option from the context menu.
This option allows the user to establish or change the configuration of the Processor.
Start or Stop:
This option allows the user to start or stop a Processor; the option will be either Start or Stop, depending on the current state of the Processor.
This option opens a graphical representation of the Processor’s statistical information over time.
This option displays the NiFi Data Provenance table, with information about data provenance events for the FlowFiles routed through that Processor
This option takes the user to the Processor’s usage documentation.
This option allows the user to change the color of the Processor, which can make the visual management of large flows easier.
Center in view:
This option centers the view of the canvas on the given Processor.
This option places a copy of the selected Processor on the clipboard, so that it may be pasted elsewhere on the canvas by right-clicking on the canvas and selecting Paste. The Copy/Paste actions also may be done using the keystrokes Ctrl-C (Command-C) and Ctrl-V (Command-V).
This option allows the DFM to delete a Processor from the canvas.
Configuring a GetFile Processor:
To configure a processor, right-click on the Processor and select the Configure option from the context menu. The configuration dialog is opened with four different tabs, each of which is discussed below. Once you have finished configuring the Processor, you can apply the changes by clicking the Apply button or cancel all changes by clicking the Cancel button.
Note that after a Processor has been started, the context menu shown for the Processor no longer has a Configure option but rather has a View Configuration option. Processor configuration cannot be changed while the Processor is running. You must first stop the Processor and wait for all of its active tasks to complete before configuring the Processor again.
This tab contains several different configuration items. First, it allows the DFM to change the name of the Processor. The name of a Processor by default is the same as the Processor type. This tab also includes other configuration settings like auto terminate relationship,penalty duration, yield duration, bulletin level(refer Apache User Guide for more detail).
The second tab in the Processor Configuration dialog is the Scheduling which contains the different types of scheduling such as Time Driven , Event Driven and Cron Driven.
The Properties Tab provides a mechanism to configure Processor-specific behavior. There are no default properties. Each type of Processor must define which Properties make sense for its use case.
Here we have to configure properties for a GetFile and PutHDFS Processors.
GetFile Processor Properties:
Creates FlowFiles from files in a directory. NiFi will ignore files it doesn’t have at least read permissions for.
|Name||Default Value||Allowable Values||Description|
|Input Directory||The input directory from which to pull files
Supports Expression Language: true
|File Filter||[^\.].*||Only files whose names match the given regular expression will be picked up|
|Path Filter||When Recurse Subdirectories is true, then only subdirectories whose path matches the given regular expression will be scanned|
|Batch Size||10||The maximum number of files to pull in each iteration|
|Keep Source File||FALSE||true
|If true, the file is not deleted after it has been copied to the Content Repository; this causes the file to be picked up continually and is useful for testing purposes. If not keeping original NiFi will need write permissions on the directory it is pulling from otherwise it will ignore the file.|
|Indicates whether or not to pull files from subdirectories|
|Polling Interval||0 sec||Indicates how long to wait before performing a directory listing|
|Ignore Hidden Files||TRUE||true
|Indicates whether or not hidden files should be ignored|
|Minimum File Age||0 sec||The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored|
|Maximum File Age||The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored|
|Minimum File Size||0 B||The minimum size that a file must be in order to be pulled|
|Maximum File Size||The maximum size that a file can be in order to be pulled|
Here we configure Input Directory location (**/home/hduser/details_console) of the file where it is stored on the local disk. Other required properties we can override as per our requirements. Here i leave it as default configurations.
PutHDFS Processor Properties:
|Name||Default Value||Allowable Values||Description|
|Hadoop Configuration Resources||A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a ‘core-site.xml’ and ‘hdfs-site.xml’ file or will revert to a default configuration.|
|Kerberos Principal||Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties|
|Kerberos Keytab||Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties|
|Kerberos Relogin Period||
|Period of time which should pass before attempting a kerberos relogin|
|Additional Classpath Resources||A comma-separated list of paths to files and/or directories that will be added to the classpath. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.|
|Directory||The parent HDFS directory to which files should be written
Supports Expression Language: true
|Conflict Resolution Strategy||fail||replace
|Indicates what should happen when a file with the same name already exists in the output directory|
|Block Size||Size of each block as written to HDFS. This overrides the Hadoop Configuration|
|IO Buffer Size||Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration|
|Replication||Number of times that HDFS will replicate each file. This overrides the Hadoop Configuration|
|Permissions umask||A umask represented as an octal number which determines the permissions of files written to HDFS. This overrides the Hadoop Configuration dfs.umaskmode|
|Remote Owner||Changes the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner|
|Remote Group||Changes the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group|
|No Description Provided.|
Here we configure Hadoop Configurations Resources directory (**/home/hduser/hadoop-2.7.1/etc/hadoop/core-site.xml and hdfs-site.xml) and destination directory (**nifi/nifi_put_hdfs_ex) where to store the file in HDFS. We can override other configuration properties, here i configured with default configurations properties.
Each of the Relationships that is defined by the Processor is listed here, along with its description. In order for a Processor to be considered valid and able to run, each Relationship defined by the Processor must be either connected to a downstream component or auto-terminated.
If a Relationship is auto-terminated, any FlowFile that is routed to that Relationship will be removed from the flow and its processing considered complete. Any Relationship that is already connected to a downstream component cannot be auto-terminated.
Here we checked the success and failure checkboxes for auto terminate the data flow whether the flow may get success or failure.
The last tab in the Processor configuration dialog is the Comments tab. This tab simply provides an area for users to include whatever comments are appropriate for this component.
Once processors and other components have been added to the canvas and configured, the next step is to connect them to one another so that NiFi knows what to do with each FlowFile after it has been processed. This is accomplished by creating a Connection between each component. When the user hovers the mouse over the center of a component, a new Connection icon appears
The user drags the Connection bubble from one component to another until the second component is highlighted. When the user releases the mouse, a Create Connection dialog appears. This dialog consists of two tabs: ‘Details’ and ‘Settings’.
Here we configured create connection dialog box as per the default details and settings configurations.
Before trying to start a Processor, it’s important to make sure that the Processor’s configuration is valid. A status indicator is shown in the top-left of the Processor. If the Processor is invalid, the indicator will show a red Warning indicator with an exclamation mark indicating that there is a problem.
In this case, hovering over the indicator icon with the mouse will provide a tooltip showing all of the validation errors for the Processor. Once all of the validation errors have been addressed, the status indicator will change to a Stop icon, indicating that the Processor is valid and ready to be started but currently is not running.
Command and Control of the DataFlow
When a component is added to the NiFi canvas, it is in the Stopped state. In order to cause the component to be triggered, the component must be started. Once started, the component can be stopped at any time. From a Stopped state, the component can be configured, started, or disabled.
Starting a Component
In order to start a component, the following conditions must be met:
The component’s configuration must be valid.
All defined Relationships for the component must be connected to another component or auto-terminated.
The component must be stopped.
The component must be enabled.
The component must have no active tasks.
Components can be started by selecting all of the components to start and then clicking the Start icon ( ) in the Actions Toolbar or by right-clicking a single component and choosing Start from the context menu.
Starting GetFile Processor:
Starting PutHDFS Processor:
Input File in Local Ubuntu:
HDFS State before start Data Flow:
Local File Directory in ubuntu after starting Data Flow:
HDFS State after starting Data Flow:
Stopping a Component
A component can be stopped any time that it is running. A component is stopped by right-clicking on the component and clicking Stop from the context menu, or by selecting the component and clicking the Stop icon ( ) in the Actions Toolbar.