FB_init

Sunday, November 18, 2018

PySpark: references to variable number of columns in UDF


Problem statement:

  Suppose that you want to create a column in a DataFrame based on many existing columns, but you don't know how many columns, possibly because that will be given by the user or another software.

This is how you can do it:

Saturday, November 17, 2018

PySpark: Concatenate two DataFrame columns using UDF


Problem Statement:
  Using PySpark, you have two columns of a DataFrame that have vectors of floats and you want to create a new column to contain the concatenation of the other two columns.

This is how you can do it:

Thursday, November 15, 2018

PySpark, NLP and Pandas UDF


Problem statement:
  Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.

This is how to do it using @pandas_udf.



spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.

Now you can create a new column in the dataframe calling the function.


For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Tuesday, November 13, 2018

Pandas UDF for PySpark, handling missing data


Problem statement:
  You have a DataFrame and one column has string values, but some values are the empty string. You need to apply the OneHotEncoder, but it doesn't take the empty string.

Solution:
  Use a Pandas UDF to translate the empty strings into another constant string.

First, consider the function to apply the OneHotEncoder:

Now the interesting part. This is the Pandas UDF function:

And now you can create a new column and apply the OneHotEncoder:


For more information, see https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html#pandas.Series.str.replace .

This is the exception you get if you don't replace the empty string:

   File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Cannot have an empty string for name.'


Tuesday, September 25, 2018

Pastora Tânia Tereza Carvalho e o Deus estuprador


   Ouvi a Pastora Tânia Tereza Carvalho nesse final de semana aqui em Ottawa por várias horas. De acordo com a Pastora Tânia, Deus é um carrasco que pune especialmente os "pecados sexuais". Se houver algo de errado na sua vida, é algum pecado cometido por você ou por algum antepassado. Comece sentindo-se culpada primeiro. Depois ore a oração da Pastora Tânia para quebrar a maldição. Mas tome muito cuidado porque o Deus estuprador está em todos os lugares e tudo vê. Ele está também no seu lobo frontal.
   Não obstantes os alívios efêmeros, essa tormenta nunca terá fim, amiga crente. Paulo já gritava desesperado dizendo "miserável homem que sou". Santo Agostinho quando menino roubou peras de seu vizinho. O peso e a culpa desse pecado renderam sete capítulos em seu livro Confissões. A tensão entre corpo e espírito afligia os barrocos.
   A Pastora Tânia teve uma infância sofrida. Foi abandonada pelos pais e abusada. Diante do absurdo da violência e da dor, a resposta é uma segunda tristeza: a justificativa irracional. Depois que Deus assediá-la moralmente e abusá-la no lugar mais íntimo, a saber, no seu próprio ser, o Justo Juíz não será nem julgado. E a culpa é sua.


Thursday, September 20, 2018

E-Commerce, AI and Societal Challenges - my talk at Data Day 5.0 at Carleton University

This was my talk at Data Day 5.0 at Carleton University on 5/Jun/2018.

  One of the applications of Big Data and E-commerce that we have been working on is a product recommender. While the Universal Recommender is independent of e-commerce platform, we tested it with Oracle ATG Web Commerce and SAP Hybris Commerce.
  The Universal Recommender models any number of product properties and user properties, as well as user actions in relation to products such as 'view-product', 'add-to cart' and 'purchase'.
  Often in e-commerce, only a portion of product catalogues change frequently, like price and stock, while new shopper events appear gradually. One of the things we would like to explore is reinforcement learning, trying to model deltas and events for better use of computation resources.
  One problem in e-commerce that may be common to other domains in Machine Learning is that of data modeling. While the Universal Recommender leverages product and user features, the data representation cannot leverage the inherent hierarchical property of product categories, for instance.
  Another challenge with data representation is that of base products versus variants. For example, a shirt can be a base product, but the buying unit, or SKU, may be a large green shirt. And products can also be configurable, like a hamburger.
  I'd like to list other functionality where Big Data and Machine Learning can help e-commerce systems. Search results can be seen as the output of a recommender system. When APIs are exposed on the Internet, hackers can try to steal login information or reward points or abuse of promotions. Machine Learning can be used to detect malicious requests. This problem is not specific to e-commerce: the approach can be used in micro-services in general. Machine Learning can also be used to better detect credit card fraud. Another challenge in e-commerce is the verification of product reviews. Detecting cart abandonment is another application of Machine Learning. In Product Information Management, classification algorithms and NLP can automate product categorization, specially for large catalogues and B2B sites integrating external vendor catalogues. Classification algorithms and NLP with sentiment analysis can help Customer Care with case tagging, case prioritization and dispute resolution.
  A recent development in the area of e-commerce and legislation is The European Union General Data Protection Regulation is designed to protect the data privacy of all European residents and requires website operators, among other requirements, to consider any applicable notice and consent requirements when collecting personal data from shoppers (for example, using cookies). GDPR not only applies to organizations located within Europe, but it also applies to organizations located outside of it, if they offer goods or services to European shoppers. It came into effect May 25th of this year and it defines fines for non-compliance. GDPR in my view includes a bigger scope of rights and protections when compared with the Personal Information Protection and Electronic Documents Act that we currently have in Canada. Thanks to GDPR, for example, we now know that PayPal shares personal data with more than 600 third party organizations. Of particular interest to Machine Learning are the rights in relation to automated decision making and profiling. Quoting Mr. Andrew Burt: "The GDPR contain a blanket prohibition on the use of automated decision-making, so long as that decision-making occurs without human intervention and produces significant effects on data subjects." One exception is when the user consents explicitly. One issue is the interpretation of what can be called "rights to explainability", that is, rights to explain the algorithm, its model and reason considering ML algorithms and models can be difficult to explain. Another challenge is the right to erasure. Should the ML model be retrained after the user asks to delete her or his data? The Working Party 29, an official European group involved in drafting and interpreting the GDPR, understands that all processing that occurred before the withdrawal remains legal. However, a more critical view could argue that the model is directly derived from the data and can even reveal the data with overfitting in some cases.
  The first international beauty contest judged by AI accepted submissions from more than 6,000 people from more than 100 countries. Out of the 44 winners, nearly all were white women. In the United States, software to predict future criminal activity was found to be biased against African-Americans. In these cases, as in others, societal prejudices made their way into algorithms and models. Let's not forget that software systems are built in certain contexts and deployed into contexts. The international beauty contest reveals a naïveté: as if software could reveal a culturally neutral and racially neutral conception of beauty. A coalition of human rights and technology groups at the RightsCon Conference a few weeks ago drafted what is called "The Toronto Declaration". RightsCon covered a broad list of important subjects related to AI and society. The Toronto Declaration emphasizes the risk of Machine Learning systems to "intentionally or inadvertently discriminate against individuals or groups of people". It reads:
"Intentional and inadvertent discriminatory inputs throughout the design, development and  use of machine learning systems create serious risks for human rights; systems are for the most part developed, applied and reviewed by actors which are largely based in particular countries and regions, with limited input from diverse groups in terms of race, culture, gender, and socio-economic backgrounds. This can produce discriminatory results."
Standford's University One Hundred Year Study on Artificial Intelligence is a long-term investigation of AI and its influences in society. Section 3 of its latest report - "Prospects and Recommendations for Public Policy" -   calls to "Increase public and private funding for interdisciplinary studies of the societal impacts of AI.
As a society, we are underinvesting resources in research on the societal implications of AI technologies."
  Thinking of societal issues raised by AI and automation, I chose to quickly mention the issue of employment. One good article by Dr. Ewan McGaughey titled "Will Robots Automate Your Job Away? Full Employment, Basic Income, and Economic Democracy" argues that robots and automation are not a primary factor of unemployment. He writes: "once people can see and understand the institutions that shape their lives, and vote in shaping them too, the robots will not automate your job away. There will be full employment, fair incomes, and a thriving economy democracy. "
  My best friend is a professor of Machine Learning. He's funny and at times even sarcastic. One day he described to me a hypothetical conversation with a Philosopher. "When I submit applications for grants, I ask for hardware and software for my lab. What do you ask for? A chair?" The Philosopher Thomas Khun could have replied: 'New science gets accepted, not because of the persuasive force of striking new evidence, but because old scientists die off and young ones replace them.' The enveavour of modeling and interpreting zeros and ones is epistemic and hermeneutic. The time is not for isolation, but for collaboration. More than ever we need to engage the Social Sciences and the Humanities to help us understand what we do, how we do it and why we do it.

Friday, May 11, 2018

Dados abertos - Câmara Legislativa - chamados REST e arrays

No primeiro post (
http://gustavofrederico.blogspot.ca/2016/12/dados-abertos-camara-legislativa.html ) eu mostrei como usar os dados da Câmara numa planilha do Google.

Nesse post, mostro como fazer a mesma coisa, mas usando um tipo de chamado diferente do anterior, e usando arrays. Vejamos o que é cada uma dessas coisas.

No primeiro post, a planilha do Google chamava o web service da Câmara. O serviço da Câmara retornava um dado do tipo XML. Nesse exemplo, como veremos, chamaremos um serviço do tipo REST. Tanto REST quanto web services são tipos de serviços muito comuns na informática.

Nesse exemplo também mostro como usar um array. Um array é uma lista de valores. O serviço REST da Câmara retorna dados no formato "JSON", sendo que certos dados são do tipo array. Esse array, isto é, essa lista de valores aparecerá na planilha como uma sequência de células ("range" em Inglês).

Começamos da mesma maneira. Na planilha do Google, para criar uma fórmula customizada, vá no menu "Ferramentas" -> Editor de scripts... 

Crie uma função chamada obterDetalheEvento, como na imagem abaixo.



A função chama um serviço REST da Câmara Legislativa que lê dados de eventos. Veja os detalhes desse chamado na documentação aqui:

https://dadosabertos.camara.leg.br/swagger/api.html

A função retorna um array ( a lista de dados ). Veja como utilizar a função abaixo:




Como a função retorna um array, a planilha do Google preenche as células abaixo da fórmula com valores. Os argumentos da função incluem referências a células na planilha. O último argumento, (titulo, descricaoSituacao, dataHoraInicio, etc, por exemplo) são referências a chaves no JSON retornado pelo serviço da Câmara. Para saber como é esse JSON retornado, veja a documentação novamente em https://dadosabertos.camara.leg.br/swagger/api.html .

Só para se ter uma ideia, aqui está uma parte dos dados em formato JSON retornados pelo serviço:



Nota sobre tipo de dados:
A fórmula obterDetalheEvento como demonstrada usa valores de datas como string. (B5 e B6 no exemplo) É importante "forçar" o uso da data nas células como string, senão haverá um problema de tipos na fórmula. Para "forçar" as datas como string, use o caracter de aspas simples (')  no início do valor na fórmula.