Machine Learning : Support Vector Machine  Algorithm 

 

What is SVM ?

Its classification technique, Which means , it is used to split the data. It helps to
split the data in the best possible way. It provides the best split.

How SVM is going to do that ?

It tries to find the widest margin between the two groups of data. It may not consider
every point in the graph, it just considers only the points ( These points
are called as support vectors )near to the separation line ( this line is called as
Hyper plane ).

The way it calculates is , it always tries to find the hyper plane which can be far
from the support vectors.

Suppose you are trying to draw a separation line between india and pakisthan using SVM, its ensures that line is equally far from both the countries.

in this case, support vectors are kashmir, Punjab etc from india side , and places from pakisthan which are close to the separation line are considered as support vectors from pakisthan side.

Places like AP, Karnataka , Tamilnadu from india cannot be considered as support vectors as they are far from the line of separator.

Can i always get an accurate line which can separate the data using SVM ?

We may not, some times it depends on the data and other factors.

Are there any ways i can tune in , to make the line correct or get correct split ?

We can tune the model using the following parameters, These parameters can help in
reducing overfitting and having better fit.

1) Kernel:
We can make use of kernels when data is not linearly separable. For eg
if you have data as concentric circles, you cant separate the data using hyper
plane, in such cases you can take the help of kernels , that will bring the data
to linearly separable form. Different kernels that are available are linear,poly,
rdf,sigmoid etc.

2)Gamma ( if you want to consider farther points like AP, Karnataka you can play with this parameter )
Low Gama, Far points are also considered for decision boundary.
Higer gamma gives curvy lines, which means proper classification.

3) C Parameter:
Controls tradeoff between smooth boundary or classifying exactly on training points correct.
Bigger c may implies curvy lines , which means proper classification.

Advantages
Performs well when there is marginal difference.
Works well on small data sets with higher dimensions

Disadvantages:

When there is more overlapping this is not recommended, In those case naive base is better.

Doesn’t perform well with larger data sets

Python Code:

model = svm.svc(kernel =’linear’) // Kernel can be modified here, default is linear
model.fit(x_train,y_train)

Using C parameter:
Examples
model = svm.svc(kernel =’linear’,c= -12 ) // Default is 1
model = svm.svc(kernel =’linear’,c= 12 )

To conclude , in layman terms, we can assume that, SVM prefers to have thick gap than a thin gap between two groups of data.

Architecture of one my cloud solution using Microservices with AWS Components

Components Leveraged :

Jenkins/GitHub/EC2

Source code versioned with  GitHub , built with the Jenkins server running in  AWS EC2 instance for continuous integration on every commit to the GitHub repository, using web hook.

Domain​ ​Registration:
Route​ ​53​ is used for public domain registration.

Amazon ECS
Deployed services as docker container in ECS clusters by creating required task definitions with required  configurations.

API Gateway:
Gateway for accessing the micro services deployed with in ECS .

Elastic Load Balancing:
Helps in balancing the incoming traffic to the system.

Amazon Cloudfront:
Web application deployed with cloudfront distribution for faster retrieval of images. Cloud front makes use of edge locations for better performance.

AWS Lambda:
integrated with cloudwatch. Used for invalidating the cache in cloudfront , when a new image is stored in S3 Bucket.

SNS:
Cloud based pub/sub messaging service used for sending the notifications and alarm when the system is unhealthy to the management.

IAM:
Used for authorized access for various AWS services and resources.

Cloudwatch:
Used for monitoring various cloud services deployed as part of the system.

Lex and Poly:
Used for chat bot services of the application. Lex provides interface with social channels like facebook. Poly used for text to voice service in the application.

S3:
Frequently accessed images stored in S3, on the other end old images are stored in glacier for audit purpose.

How to Design and implementing a web scale cloud content storage and delivery infrastructure on AWS using S3

Architecture for AWS Storage and Delivery infrastructure :

The above proposed solution makes use of different aws services for getting a scalable and secure optimistic solution.

Two S3 buckets are considered for the complete solution.

Basically we considered two buckets one is primary and the other is secondary , where primary bucket is considered as primary site and secondary is considered as disaster site.

Disaster Recovery : To make best of use of disaster site, it’s always recommended to create the disaster site in a completely different region in a different geographical area, therefore we created two buckets in two different regions , one in north california and other in EU-London.AWS S3 Replication feature can help us to meet this requirement by replicating data from the bucket of one region to the other region.

Cost optimization is the big thing that we can provide to customer by making best use of lifecycle policy defined at bucket. We can move the object at configured time intervals to different storage classes like SIA and Glacier as the cost can be reduced, SIA can be used to move infrequent access data , however when required it will be instant. Glacier can be used to move the archive data which can be used for audit or for any year end activities. Following is the policy we defined on the buckets as per the current requirement, we will move the objects from Standard S3 to SIA after 75 days from the date of creation and then we will keep it there for few months until we complete a year from date of object creation.Once a year is completed we will keep the data in glacier for one more year which can be used for audit and then we can purge the data.

3)Authentication and Authorization is the other big thing that we can provide to the customer to make the data more secure. AWS provides security,identity and compliance for the the same. We can create users/groups with IAM and associate the same with the required bucket. Here is the snapshot of security policy for one of the bucket, the same can be applied to any bucket. You have a choice of configuring this along with S3 configuration after creating the users at IAM or you can directly go to IAM and do the configuration there as well by attaching the required bucket.

4) Geo-restrictions is the other thing we can offer through AWS , it helps the customer to enforce restrictions to a particular country. There might be use cases where customer want to enforce this because of some government policies. Following is the snapshot of the geo restrictions that we had for the country CUBA.

5) Reduced Latency : This is one big thing that any customer looks at. AWS provides a CDN offering called cloudfront for achieving the same. CloudFront can be configured to serve the content of S3 bucket so that data can be served fastly across the globe through the edge location deployed globally. Reliability can be there because of different edge locations. Following snapshot shows the cloudfront distribution and its configuration.

Cloud Deployment models and services offered..My Views.

What are the different deployment models ?
To understand at a high level following are the different deployment models :

Public Cloud : Consumers of public cloud share the resources of a public data center of the vendor. ( Amazon aws is one big player in this area )
Private Cloud: Resources are dedicated to the given customer and can be hosted at vendor private network or at the customer’s personal data centers.
Hybrid Cloud: This is the combination of on premise and cloud.

Once deployed whats next ? ( Cloud Services )
One we decide on the deployment model, the next step is consuming of cloud services, and we need to decide on what services we are interested in.

Following quick set of questions can help us understand the need for different types of services

1)Are you interested in getting just a virtual machine from the vendor and you want to install your own OS, softwares and build your own products with your own stack of technologies ?

IAAS ( Infrastructure As A Service ) is the solution.Amazon EC2 is one example for IAAS, This is a bare virtual machine which is ready to install any OS of your choice , you can also deploy your own stack to build your own product and you don’t need to worry about the infrastructure availability, scalability,reliability etc. Total infrastructure is taken care by the IAAS provider. This service can be helpful, if you want to port your own legacy products into cloud, as you will be porting the complete stack of your own. Also to add, IAAS starts from the virtualization layer in world of cloud computing.

2) Are you interested in getting a basic infrastructure along with some Middleware services and build your own product with these provided services ?

PAAS ( Platform As A Service ) is the solution.The advantage that we get with this kind of service is , you can make use of ready made Middleware services and build your own product focussing on your business idea.Google FireBase Storage, Amazon DynoDB are some of PAAS services. I worked on a product using Google Firebase Storage and Database, This helped me to focus only on inserting/reading/updating data and its not necessary for me to worry about performance, software upgrade, availability etc. To summarize PASS starts with one layer above virtualization with entities like OS, middleware services etc that can be used readily.

3) You don’t want to take the headache of Infrastructure or technology stack or building a product and just want to use a product and pay for what you use ?

SAAS ( Software As A Service ) is the solution. The biggest advantage that you get here is you are just using the product without any hassles of maintenance , upgrades, security , scalability,reliability etc. Companies like Oracle, SalesForce, Google provide number of SAAS services. For eg you want to use Mail service for your company , don’t build anything from thing from scratch, just use Google SAAS offerings for the same. Similarly If you want to have Identity management for your organization, Oracle provides SAAS offerings like IDCS. SAP has a suite SAAS offerings ( eg: SAP Hana Finance ) for running your organization smoothly.

Conclusion:
At a High level , to summarize every cloud delivery model has its own advantages, However the choice of delivery model depends upon the kind of requirement the organization is having. Say you are a startup company and want to work on building your idea, it doesn’t make sense to focus on every basics including infrastructure, in such cases PAAS acts a good fit for you. Say you are a company who is well established and want to make seamless transition to cloud with your existing software solution, IAAS can be a best for you . Finally coming to SAAS , it can be best fit for any user who want to use or integrate existing solutions.

From my personal experience as a developer in developing a product of my idea, i prefer using PAAS( i had used google cloud features like fire base ) as it helps me to focus more on my business idea.

Summary of cloud computing evolution and my views on why it has become such a relevant topic in today’s IT industry?

Evolution of cloud computing ?

The concept of cloud computing started way back with mainframes around 1950 to 1960 period where terminals are used to access the computing power of main frames. The reason for using such terminals is because of the cost/maintenance and other factors involved with mainframes. Also you cannot afford to have a mainframe wherever required, instead we can try accessing its computing power through some dumb terminals . That’s how the journey of remote computing started.

The journey then shifted from remote computing to distributing plus remote computing with the advent of virtualization. One best usage of virtualization is it helps to host multiple hosts on a single physical/multiple machine which in turn helps us to distribute the computing power.
Virtualization acts as foundation for the total cloud computing.

Therefore i feel cloud computing is capability of using distributed and remote computing which in turn leads to numerous benefits .

Following are some of the benefits i see with cloud computing.

Elasticity – This helps us to grow and shrink the resources as required.
Recovery – No need to worry about disaster as complete disaster is handled by cloud vendors
Security – Most of cloud vendors adhere to the standard security standards, so it’s not necessary for me to worry of security breaches.
Cost Effective: I can pay for the product only if i use more resources, using more resources is the sign of increased revenue , hence i feel this model suits for startup to start with minimum capital.
Accessibility: As the services are hosted over the cloud , i don’t need to worry about accessing issues , i can access from anywhere as long as i am meeting the required security standards.
Self service: Most of Cloud vendors provide self service capabilities it doesn’t require me to reach any one in person, on the click of a mouse, most of services are available.
Ease of Usage: it easy to use/integrate most of the cloud services with just basic configurations.

Conclusion:
From the above features i definitely feel that cloud computing will have a great impact in all current products along with the next generation technologies like IOT, Virtual reality and AI which will occupy every aspects of our life . I feel a day will come , where we will be having our monthly bill for cloud computing along with our utility bills like water bill, Gas bill etc :).

Log Management Kubernetes

Centralized logging

When we try troubleshooting a problem, trace a security or performance issue , logs are best source of information. In the context of distributed environment centralized logging will help to accumulate the data and provide benefits like proactively managing your network , reducing the bug analysis time, improving security, reduce the risk of losing the data , providing aggregated performance statistics etc.

Business context

Centralized logging is more relevant in kubernetes ecosystem . While building any development environment around kubernetes, It’s highly recommended to have centralized monitoring of all the pods and nodes present in the the kubernetes ecosystem . As we are aware , Kubernetes is a container management system or in other words an orchestration system which will orchestrate all the pods in the cluster. Having said that, there might be many worker/master nodes running in the cluster with different work loads.In this context, if something goes wrong in the total ecosystem or in any particular worker nodes or a pod, it will challenge the developer to root cause the issue as the logs are scattered across different containers and nodes and there is no centralized way of accessing the logs.

With these problems in mind , we want to look at a solution, that lets the developers obtain the logs in a more simple way with a single entry point and with more visualized information for better analysis. Achieving this, can definitely improve the developer productivity , improves turnaround time in fixing the issues which inurn can help in providing high availability of the system.

How can i achieve centralized logging?

In Today’s distributed computing ,new set of solutions have been designed for high-volume and high-throughput log and event collection. Most of these solutions are event streaming and processing systems and logging is just one use case that can be solved using them. All of these have their specific features and differences but their architectures are almost same.. They generally consist of logging clients and/or agents on each specific host. The agents/Log Shippers forward logs to a cluster of collectors which in turn forward the messages to a scalable storage tier (eg: ElasticSearch ). The idea is that the collection tier is horizontally scalable to grow with the increase number of logging hosts and messages. Similarly, the storage tier is also intended to scale horizontally to grow with increased volume.

Logging Solutions

Today In the cloud native environment to provide better insight into the health and state of the system , there are different ways of accessing , processing and publishing the logs. Scribe, Flume, Logstash, Kafka, Splunk, fluentd are some of the log shippers available in today market. In the context of kubernetes deployments , The most light weight shippers like Logstash , Fluentd and Beats are frequently used for log management. Following are the stacks

ELK ( ElasticSearch and Kibana with Logstash )

EFK ( ElasticSearch and Kibana with Fluentd )

EBK ( ElasticSearch and Kibana with Beats )

Among the above three stacks available, most of the deployments make use of Fluentd or Beats as log shippers along with Elasticsearch and Kibana , reason being these two log shippers are lightweight compared to logstash ( logstash is more resource consumption( default heap size is 1GB ). Through performance is improved in logstash in latest versions , it still slower than its alternatives. In a Typical scenario logstash takes 120MB compared to fluentd 40 MB ).

Following are some other differences between logstash and Fluentd ( Reference :
https://logz.io/blog/fluentd-logstash/ )

When it comes to the choice between Fluentd and Beats, Following are some inputs

Fluentd can be used for shipping the logs ,However some of the downsides of fluentd are its different plugins which are maintained by different individuals across the globe and there is a potential risk of version incompatibilities, at its core fluentd lacks the concept of filters for massaging the data, at the same time fluentd does not provide any ready to go dashboards for visualizing the data. Beats are lightweight compared to fluentd.

On the other end Beats are also used for shipping of logs and are lightweight , extensible and ready to play agents and they also have better visibility into the containers . As they are developed by dedicated elastic community, there will not be any version compatibility issues, That’s not the only reason to choose beats, they also come up in different flavors for different requirements and each flavour has different set of modules to handle different use cases. As a developer we are just supposed to pick up proper beat and configure the required modules.Once proper beats and modules are configured, beats come up with default dashboards which are ready to use in kibana.

Different Beats offered by Elastic :

Filebeat – Used to monitor files, can deal with multiple files in one directory, has module for files in well known formats like nginx, apache https, etc.

Metricbeat – Monitors the resources of our hardware, think about used memory, used cpu, disk space, etc

Heartbeat – Used to check endpoint for availability. Can check times it took to connect and if the remote system is up

Packetbeat – Takes a deep dive into the packets going over the wire. It has a number of protocols to sniff like http, dns and amp.

Most frequently used beats from the the above beats family are Filebeats and Metricbeats

Filebeats – A tiny library with no dependencies. It takes very little resources and have lots of knobs to tweak . Used for monitoring files and can deal with multiple files in one directory,In the latest versions it can send data to kafka and Redis to support heavy loads.

Apart from the above filebeat provides other modules like Icinga , Kafka , Logstash , MySQL ,Nginx ,Osquery ,PostgreSQL ,Redis Traefik for captruing the respective logs.

Metric Beats : It’s also a tiny library with no dependencies and used for getting the metric information and has different modules like Docker , Kubernetes , System, Kafka etc.

Apart from the above mertricbeats provides other metric modules like Aerospike , Apache , Ceph ,Couchbase ,Docker ,Dropwizard ,Elasticsearch Etcd ,Golang ,Graphite ,HAProxy ,HTTP ,Jolokia ,Kafka ,Kibana ,Kubernetes Logstash ,Memcached ,MongoDB ,MySQL ,Nginx ,PHP_FPM ,PostgreSQL ,Prometheus RabbitMQ ,Redis ,System ,uwsgi ,uwsgi ,vSphere ,Windows ,ZooKeeper

Elasticsearch Datasource Grapahana

What is Graphana?
Grafana is an open source visualizing tool which can be used with different time series databases.

What is Elasticsearch ?
ElasticSearch is NOSQL database or a search engine based on Lucene

Why Elasticsearch with Graphana ?

In my current Dev cluster , Elasticsearch is used as a repository for storing the logs and kibana is used for visualizing the logs and monitoring Elasticsearch. However Kibana doesn’t provide rich graphical interface, when it comes to metrics. Therefore , the focus here is to check if Graphana monitoring capabilities can be leveraged with the data( In this context, its Dev cluster logs ) present in Elasticsearch.

Why Graphana ?
Grpahana is already part of our infrastructure , with its rich graphical interface dashboards , its our choice of monitoring the components present in Dev cluster.

How Graphana interacts with Elasticsearch ?

There are two ways that Graphana can interact with Elastic search

1) Prometheus Datasource : [ Elasticsearch -> Prometheus Exporter -> Prometheus -> Graphana ]

Using Prometheus exporters, Elasticsearch data can be scraped by Prometheus and then Prometheus datasource can be used by Graphana dashboards .

2) Elasticsearch Datasource :[ Elasticsearch -> Graphana ]

Elasticsearch can be attached as a Datasource with Graphana dashboards.

POC: ( Elasticsearch DataSource )

This POC is to explore the capabilities of Graphana having elastic search as datasource with some sample data.Graphana provides different configurations for different datasources, as per the datasource capabilities.

In the case of Elasticsearch datasource, Graphana provides three importants fields

Query – Can have any Lucene query
Metric – Aggregate functions.
GroupBy – For Grouping .

Multiple queries and metrics can be defined.

Configuring Datasource: ( Attached Screenshot for reference )

1) Add DataSource:
Settings-> Datasource ( Select Elasticsearch)
2) Mandatory Configuration Parameters
Elasticsearch server URL, Access Mode, Index Pattern , TImeField Name

With the above two steps, Elasticsearch is ready to be used as Datasource in Graphana.

Configuring Dashboard: ( Attached Screenshot for reference )

Once datasource is created , next step is to create the dashboard and attach the datasource created with the above steps.

1) Select new Dashboard.
1) Select required Panel by going to Dashboards.
( Eg : Bar Graph, Pie Chart etc )

2) Right click on the panel header and click edit.
3) Select Metrics Tab and then select the datasource you are interested in. ( Select the Datasource
created in the above step. )
4) Finally select the required metrics and Groupby Attributes.

5) Updated Dashboard displayed.

Conclusion
To make use of elasticsearch as datasource with Graphana, We need to figure out Elasticsearch index pattern we are interested in. Followed by that , we need to identify the required fields/attributes in those patterns and configure the dashboard with the required queries/metrics on the fields identified .

Elasticsearch X-Pack

What is x-Pack?
X-pack referred as extended elastic search is a plugin that can installed with elastic search server.

What is it all about ?
Until Elasticsearch 5.0.0 some of features present with elastic search are shifted as individual plugins
like Watch, Marvel,shield etc. However with X-pack in place , all features are bought under one
stack which is referred as X-pack.

Installation reference:

https://www.elastic.co/guide/en/x-pack/current/installing-xpack.html

What are different stack of components that i get with X-pack ?

Following are the stack of components that we get with X-pack, These components sits in ElasticSearch, Kibana or Logstash as per the feature requirement.

Security: ( Formerly referred as Shield)
Provides Authentication ( integrating with Active directory, LDAP), Authorization(Role based access for cluster and index operations ), Auditing features along with other features like IP filtering, Encryption etc.
Few of them can be achieved using Rest API like.

Eg:
PUT /_xpack/security/user/username ( creating users )
PUT /_xpack/security/user/roles/security ( Creating roles)

Reporting
For sharing information about the data ( OmDemand, Scheduled , Event based reports can be generated)
Mixed with altering features , reports can be share on weekly, daily basis etc.

Alerting: (Watcher )
Used for setting up alerts when some thing changes in the system .This is completely API Driven. Can be assumed as kind of trigger that gets triggered based on a particular action.

EG:
PUT _Watcher/watch/ , This API takes four parameters like

trigger : what frequency you want to watch.
Input : can be a DSL Query or any HTTP request.
action : action to be done.
condition : condition on which an action to be done.

Graphing:
Helps in Visualizing the relationship of data present with data (Eg index) present in elastic search.
Provides rest API for capturing the data.

SQL: (Read only queries )
Helps the users with expertise with SQL to interact with elastic search in SQL way. Provides an Interface to use SQL with elastic search ,returns data in tabular structure.For example we can get data from a document as simple SQL query as

select last_name,first_name from emp order by emp_no ( where emp is a document )

Monitoring: (*Formerly referred as *Marvel)
Helps in monitoring the cluster with the help of kibana dashbaords. Marvel sits as agents on the nodes and keeps shipping the health information.

Following configuration in elastic search helps in shipping the metric information.

marvel.agent.exporters:
id1:
type: http
host: [“http://:9200”%5D

Machine Learning:
Used for Anomaly Detection of the data present in elastic search , machine learning results can be viewed in kibana. This plugin can take the data in real time or can be submitted in jobs

More about X-Pack Monitoring :

What is Elasticsearch X-pack monitoring ?

X-pack monitoring which is referred as marvel earlier is used for monitoring elasticsearch and kibana. X-Pack monitoring is enabled by default when you install X-Pack ( For older versions of elasticsearch it can be installed with plugin manually). Advanced monitoring settings helps us to control how frequently data is collected, configure timeouts, and set the retention period for locally-stored monitoring indices.We can also adjust how monitoring data is displayed.


How performance stats are collected ?

X-pack monitoring makes use of two components called as collectors and Exporters for this whole job.

Collector: its a kind of agent , that runs once for each collection interval (default 10 seconds )to obtain data
from elasticsearch nodes. Once data collection is finished, data is handed in bulk to the
“exporters” to be sent to the monitoring clusters.

Types of Collectors : ( Cluster stats, Index Stats, Shards etc ) Each of these collectors collect
the respective stats information.

Exporter : Used for sending the data collected to the required elasticsearch cluster nodes.
Types of Exporters: ( Local ,Remote ) Used for exporting the data to elastic search cluster.

In addition , different beats (Auditbeat , Filebeat,Heartbeat, Metricbeat, Packetbeat, Winlogbeat ) can also be configured to collect more stats .

Where can we see performance metrics ?

Monitoring metrics that are captured from each node using the above steps, can be viewed from kibana by going to performance option.

Prometheus & Teamcity

Prometheus

Prometheus is a monitoring solution that gathers time-series based numerical data.

Why Prometheus for Teamcity ?

Teamcity provides provides a variety of diagnostic tools and indicators to monitor and troubleshoot the server, which are accessible from the Administration | Diagnostics page.Along with this, we would like to have Prometheus as one stop for health lookup for the completed stack of components present in the Dev cluster.Therefore as a part of this POC, we would like to investigate the methods of getting teamcity metrics into Prometheus.

How can Teamcity or any Third Application send metrics to Prometheus?

Prometheus refers this as as instrumentation of an application . Any application that want to expose metrics need to expose a HTTP/s endpoint with URL as /metrics. Prometheus will use this endpoint to scrape metrics at regular intervals. Application that exposes the metrics is referred to as exporter.

Readily available Prometheus Exporters: https://prometheus.io/docs/instrumenting/exporters/

Prometheus Metrics Types: https://prometheus.io/docs/concepts/metric_types/

Teamcity and Prometheus :

Teamcity doesn’t provide any exporter for Prometheus, Therefore to pull metrics from Teamcity, we need to write a custom exporter. Exporters can be developed using different languages like GO, Python,Ruby or Java.

Repos Explored for Teamcity custom exporter.

https://github.com/allistera/teamcity-exporter

https://github.com/leominov/teamcity-exporter

https://github.com/Guidewire/teamcity_exporter

https://github.com/m4h/prometheus.

POC: ( Used Python )

As a part of this POC , following metrics are exposed as an end point for Prometheus to scrape.

Teamcity Metrics Exposed:

1) Checking Build Queue over the time. ( Rest API used – /app/rest/buildQueue )

2) Number of Agents over the time. ( Rest API used – /rest/agents ).

3) Projects aggregated status: Go through all the projects in teamcity,For each project get status of each build and print aggregated build status for each project ( SUCCESS=0, RUNNING=1, ERROR=2, FAILURE=3, UNKNOWN=4′ )

( Rest API used – app/rest/buildTypes )

Prometheus URL:

http://:9090/graph

Use the following metrics and click on the button execute in the above link

1)teamcity_project_aggregated_status

2)teamcity_agents_num

3)teamcity_build_queue_len

( Attached screen shots for reference )

Teamcity Rest API https://confluence.jetbrains.com/display/TCD18/REST+API can be explored more to see what metrics can be useful.

Elastic Search Performance Tuning Parmeters

Following are some of the basic configuration parameters that can impact the performance of elastic search server.,

Field Data:

As and when the queries are issued, aggregations on analysed strings will load into field data if the field wasn’t previously loaded. This can consume more heap memory and can cause memory leaks. With the setting ( indices.fielddata.cache.size: 20%, Can be set to a percentage of the heap size, or value like 2gb) in place, the least recently used fielddata will be evicted to make space for newly loaded data , this can avoid memory leaks as field data will not be growing abnormally. Also curl API as follows can help in getting fields data usage per node. curl -X GET “localhost:9200/_nodes/stats/indices/fielddata?fields=*”

Index Buffer:

Increase the size of the indexing buffer, This setting (Eg: indices.memory.index_buffer_size 20%) determines how full the buffer can get before its documents are written to a segment on disk. The default setting limits this value to 10 percent of the total heap in order to reserve more of the heap for serving search requests This can also help , with respect to performance.

Swapping

:Elasticsearch performs poorly when the system is swapping the memory, following are some of the options to disable it .

1) sudo swapoff -a , To disable it permanently, you will need to edit the /etc/fstab file and comment out any lines that contain the word swap.

2) Edit config file provided by elastic search with parameter as bootstrap.memory_lock: true. To check these changes are effected , use the curl command. curl -X GET “localhost:9200/_nodes?filter_path=**.mlockall” , If you see that mlockall is false, then it means that the mlockall request has failed. We can do the following to set the same Set ulimit -l unlimited as root before starting Elasticsearch, or set memlock to unlimited in /etc/security/limits.conf.

Threads:

Elasticsearch uses a number of thread pools for different types of operations. Elasticsearch user can create is at least 4096.This can be done by setting ulimit -u 4096 as root before starting Elasticsearch, or by setting nproc to 4096 in /etc/security/limits.conf.

Heap memmory:

Heap size to avoid out of memory ,its recommended to always have maximum heap size to 50% of available physical RAM. Setting the following parameters in JVM.options can help to increase heap space ,eg: -Xms10g -Xmx10gNote: Dont cross Don’t cross the 32 GB limit , using -Xmx32g or higher results in the JVM using larger, 64-bit pointers that need more memory.

Virtual memory:

Elasticsearch uses a mmapfs directory by default to store its indices. The default operating system limits on mmap counts is likely to be too low, which may result in out of memory exceptions. On Linux, you can increase the limits by running the command sysctl -w vm.max_map_count=262144 as rootTo set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf. To verify after rebooting, run sysctl vm.max_map_count

References:

https://www.oreilly.com/ideas/10-elasticsearch-metrics-to-watch

http://www.pipebug.com/elasticsearch-logstash-kibana-4-mapping-2.html

https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale

https://www.datadoghq.com/blog/elasticsearch-performance-scaling-problems/

https://logz.io/blog/the-top-5-elasticsearch-mistakes-how-to-avoid-them/

https://www.ebayinc.com/stories/blogs/tech/elasticsearch-performance-tuning-practice-at-ebay/

https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-request-cache.html