FB_init

Tuesday, February 28, 2017

Analyzing request rates using R

Problem statement
A server exposes an API on the Internet, serving requests. A request has a certain IP as origin. The question is: how should one configure limits on the web server to avoid abusive requests? 'Abusive requests' are a series of requests that happen in bursts of a 'high rate' per second. Nginx allows one to limit request rates. See http://nginx.org/en/docs/http/ngx_http_limit_req_module.html . Above the threshold, Nginx responds with HTTP error 503.

General approach
The general approach is to load web server log data and to measure the request rates from it. After tabulating the data, we can decide what rates are 'abusive'. We'll configure Nginx with 
 limit_req_zone $binary_remote_addr  
meaning that it will apply the request rate limit to the source IP.

Tools
We'll use some Unix commands to pre-process the log files. If you are using Windows, consider installing Cygwin. See https://cygwin.com/install.html .
For R, we can use either a Jupyter notebook or R Studio. Jupyter was useful to organize the commands in sequence, while R Studio was useful to try ad-hoc commands on the data. Both Jupyter notebooks and R Studio are included in Anaconda Navigator. Begin with the "Miniconda" install. See https://conda.io/miniconda.html .


Install R Studio and Jupyter notebook from Anaconda Navigator.

Pre-processing the web server logs
The log files have lines including many information. We just need the date+time and source IP.

We cut twice, first on double-quotes, then on whitespace. Then we remove some extra characters if needed.


 cut -d "\"" -f1,8 fdpapi_ssl_access_w2_21.log | cut -d " " -f4,6 | sed 's/\[//' | sed 's/,//' | sed 's/\"//' > fdpapi_ssl_access_w2_21.csv  


The end result are lines with just the date+time and source IP.



Analysing the logs

The first lines of the R script load the csv file. Then, sort the rows by IP.


 library(readr)  
 logrows <- read_delim("~/TGIF_work/nginx_thr_FDPCARE-325/fdpapi_ssl_access_w_tutti.csv",   
   " ", escape_double = FALSE, col_names = FALSE,   
   col_types = cols(`coldatetime` = col_datetime(format = "%d/%b/%Y:%H:%M:%S"),   
     X1 = col_datetime(format = "%d/%b/%Y:%H:%M:%S")),   
   trim_ws = TRUE)  
 logrows <- logrows[order(logrows$X2),]  

You can peek into the log rows.
head(logrows)

X1X2
2017-02-19 00:00:04100.0.XX.244
2017-02-19 00:00:04100.0.XX.244
2017-02-19 00:03:51100.1.XXX.223
2017-02-19 00:03:51100.1.XXX.223
2017-02-19 00:03:52100.1.XXX.223
2017-02-19 00:02:48100.1.XXX.60


R has a special type called 'factor' to represent unique 'categories' in the data. Each element in the factor is called a 'level'. In our case, unique IPs are levels.


 tb <- table(logrows$X2)   

 ipsfactor <- factor(logrows$X2, levels = names(tb[order(tb, decreasing = TRUE)]))  



Let us define a function to translate the request times into numbers (minutes) relative to the first request. Then, make a histogram of the requests, placing them into 'bins' of one minute.


 # x is a vector with the datetimes for a certain IP  

 findfreq <- function(x) {  
   thesequence <- as.numeric(x-min(x), units="mins")  
   hist(thesequence, breaks = seq(from = 0, to = ceiling(max(thesequence))+1, by=1), plot=FALSE)$counts  
 }  



Here we could count the frequency of IPs using the factor and the function. That is, we could apply the function for each IP. This would allow for the visualization of an 'intermediary' step of the process.


 # Intermediary steps  

 # ipscount <- tapply(logrows$X1, ipsfactor, findfreq, simplify=FALSE)  
 # ttipscount <- t(t(ipscount))  




Transposing the array twice simplifies the visualization.
Let's now define another similar function that only keeps the maximum.


 findmaxfreq <- function(x) {  
   max(findfreq(x))  
 }  

Aggregate the data using the new function.


 ipsagg <- aggregate( logrows$X1 ~ ipsfactor, FUN = findmaxfreq )  

Display the top IPs sorted by requests per minute descending. These are the 'top abusers'. The table shows that IP 51.XXX.56.29 sent 1160 requests per minute at one point.


 head(ipsagg[order(ipsagg$`logrows$X1`, decreasing = TRUE),])  


ipsfactorlogrows$X1
51.XXX.56.291160
64.XX.231.130711
64.XX.231.169705
68.XX.69.191500
217.XXX.203.205462
64.XX.231.157458

Display quantiles from 80% to 100%


 quantile(ipsagg$`logrows$X1`, seq(from=0.99, to=1, by=0.001))  



This table is saying that 99.6% of the IPs had the max request rate of 63 requests per minute or less. Plot the quantiles.


 ipsaggrpm <- ipsagg$`logrows$X1`  
 n <- length(ipsaggrpm)  
 plot((1:n - 1)/(n - 1), sort(ipsaggrpm), type="l",  
 main = "Quantiles for requests per minute",  
 xlab = "Quantile fraction",  
 ylab = "Requests per minute")  


As you can see in the quantiles table and the graph, there is a small percentage of 'abusive' source IPs. 

Web server request limit configuration
Fron Nginx documentation: " If the requests rate exceeds the rate configured for a zone, their processing is delayed such that requests are processed at a defined rate. Excessive requests are delayed until their number exceeds the maximum burst size in which case the request is terminated with an error 503 (Service Temporarily Unavailable). By default, the maximum burst size is equal to zero. ". In this sample case, considering that 99.6% of the IPs issued 63 requests per minute or less, considering each source IP, we can set the configuration as follows:

  limit_req_zone $binary_remote_addr zone=one:10m rate=63r/m;  
  ...   
  limit_req zone=one burst=21;  

And burst is set to one-third of 63 requests per minute. This is considering one web server. If a Load Balancer is forwarding requests evenly to two servers, consider half of the values above:


  limit_req_zone $binary_remote_addr zone=one:10m rate=32r/m;  
  ...   
  limit_req zone=one burst=11;  

Useful links
Matrix Operations in R
Dates and Times in R
Dates and Times Made Easy with lubridate - I didn't have to use lubridate, but I "almost" did.
Quantiles

Download the Jupyter notebook with the R script
You can download the Jupyter notebook with the R script here: https://1drv.ms/u/s!ArIEov4TZ9NM_HBIav7QiNR28Gmu

Saturday, December 10, 2016

Dados abertos - Câmara Legislativa - Proposições

Como fazer uma Planilha do Google com dados abertos da Câmara Legislativa. Nesse exemplo uso uma fórmula customizada. Esse é o resultado final:

Note o uso de uma fórmula customizada:



Para criar uma fórmula customizada, vá no menu "Ferramentas" -> Editor de scripts...


Crie uma função chamada obterProposicao, como na imagem abaixo. A função chama um webservice da Câmara usando a classe UrlFetchApp ( ver https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app ). O webservice retorna XML que pode ser interpretado.


Porque o XML nesse caso tem uma estrutura simples só de um nível, um argumento chamado "elemento" declara "o que pegar do XML". Para ver os detalhes dos webservices da Câmara Legislativa, veja

http://www2.camara.leg.br/transparencia/dados-abertos/dados-abertos-legislativo

e

http://www2.camara.leg.br/transparencia/dados-abertos/dados-abertos-legislativo/webservices/proposicoes-1/obterproposicao

Enquanto o exemplo acima funcione, há uma melhoria e uma pergunta. Um problema é que cada referência à fórmula customizada na planilha chama o webservice uma vez. O ideal seria reduzir o número de chamadas ao webservice - se possível uma vez só pra planilha inteira - usando intervalos ( "ranges" ) de células. A pergunta é: quando é que as fórmulas são interpretadas e os dados são atualizados? Não achei a resposta muito facilmente. Em um post futuro mostrarei melhorias à solução resolvendo essas duas questões.


Dados abertos - webservices da Câmara Legislativa

  Estou olhando os webservices da Câmara Legislativa. A Câmara tem uma iniciativa de dados abertos. O requisito é ajudar algumas instituições da sociedade civil e indivíduos interessados a seguir mais detalhadamente o processo e tramitação de proposições da Câmara. Imagine que uma pessoa pode abrir uma planilha, entrar dados básicos de um Projeto de Emenda Constitucional (PEC) ou projeto de lei, clicar num menu ou entrar uma fórmula e os detalhes mais recentes do projeto de lei ou PEC aparecem dentro da planilha.

   Começo listando alguns links essenciais:

Dados abertos do Legislativo:

http://www2.camara.leg.br/transparencia/dados-abertos/dados-abertos-legislativo

Pesquisa simplificada:

http://www.camara.leg.br/buscaProposicoesWeb/pesquisaSimplificada

Debate:

 - O "pedido" foi de usar Excel. Mas como eu não tenho licença, olhei um pouco OpenOffice. Link principal: http://www.openoffice.org/development/ . Achei que eu ia demorar um pouco pra ver como usar a API. E também provavelmente ficasse complicado de usar o OpenOffice em Inglês ou de referir-se à documentação em Inglês para outras pessoas.

- Preferi usar o Google Sheets com Google Apps Script.

Links básicos:

https://developers.google.com/apps-script/guides/services/external

https://developers.google.com/apps-script/quickstart/docs

https://developers.google.com/apps-script/guides/sheets

https://developers.google.com/apps-script/quickstart/macros

https://developers.google.com/apps-script/reference/url-fetch/
 

Sunday, July 31, 2016

Installing ActionML PIO + UR - HBase problem


After following instructions to install ActionML's PIO on a single server, I get this error:

[INFO] [Console$] Inspecting PredictionIO...
[INFO] [Console$] PredictionIO 0.9.7-aml is installed at /usr/local/pio-aml
[INFO] [Console$] Inspecting Apache Spark...
[INFO] [Console$] Apache Spark is installed at /usr/local/spark
[INFO] [Console$] Apache Spark 1.6.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Console$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[ERROR] [RecoverableZooKeeper] ZooKeeper exists failed after 1 attempts
[ERROR] [ZooKeeperWatcher] hconnection-0x26fb4d06, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
[WARN] [ZooKeeperRegistry] Can't retrieve clusterId from Zookeeper
[ERROR] [StorageClient] Cannot connect to ZooKeeper (ZooKeeper ensemble: localhost). Please make sure that the configuration is pointing at the correct ZooKeeper ensemble. By default, HBase manages its own ZooKeeper, so if you have not configured HBase to use an external ZooKeeper, that means your HBase is not started or configured properly.
[ERROR] [Storage$] Error initializing storage client for source HBASE
[ERROR] [Console$] Unable to connect to all storage backends successfully. The following shows the error message from the storage backend.
[ERROR] [Console$] Data source HBASE was not properly initialized. (io.prediction.data.storage.StorageClientException)
[ERROR] [Console$] Dumping configuration of initialized storage backend sources. Please make sure they are correct.
[ERROR] [Console$] Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> /usr/local/elasticsearch, HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> some-cluster-name-thinkwrap, TYPE -> elasticsearch
[ERROR] [Console$] Source Name: HBASE; Type: (error); Configuration: (error)
[ERROR] [Console$] Source Name: HDFS; Type: hdfs; Configuration: TYPE -> hdfs, PATH -> hdfs://some-master:9000/models


Friday, July 15, 2016

Installing ActionML PIO + UR - surprise downloading dependency Scala jar


I'm following the upgrade document from http://www.actionml.com/docs/install to get ActionML installed/upgraded ( see previous posts ). When running ./make-distribution.sh I get this error:



[info] [SUCCESSFUL ] org.scala-lang#jline;2.10.4!jline.jar (792ms)
[info] downloading https://repo1.maven.org/maven2/org/fusesource/jansi/jansi/1.4/jansi-1.4.jar ...
[info] [SUCCESSFUL ] org.fusesource.jansi#jansi;1.4!jansi.jar (232ms)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] ::              FAILED DOWNLOADS            ::
[warn] :: ^ see resolution messages for details  ^ ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.scala-sbt#ivy;0.13.7!ivy.jar
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: download failed: org.scala-sbt#ivy;0.13.7!ivy.jar
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChanne


I have a fast and reliable Internet connection here at the office. Retrying resolved the problem.

ActionML PredictionIO upgrade - tricky details

I'm trying to install ActionML PredictionIO and the Universal Recommender ( see previous posts ). Now I'm following the documentation to upgrade at http://www.actionml.com/docs/install .
Some tricky details:
  - The documentation has a step with sudo su aml that will switch to the aml user. Later on it has steps such as rm -r ~/.ivy2, but you should remember not to run as the aml user!

- The documentation tells you to run ./make-distribution but they meant ./make-distribution.sh instead.


PredictionIO: Upgrading or Installing?


  After installing ActionML's PredictionIO + Universal Recommender with the "quick install" methods failed, I'm told by someone in the discussion group that I must "upgrade". And so I go to
http://www.actionml.com/docs/install








It's not clear if this page is for installing fresh or upgrading. The page has an "Upagrade"  section and a "Install Fresh" section after, in the middle of the page.